Programming Collective Intelligence

1001004005536376

Building smart web 2.0 applications

Note:
- A good starting point for finding more web sites with open APIs is ProgrammableWeb.
Introduction to Collective Intelligence
- The ability to collect information and the computational power to interpret it has enabled great collaboration opportunities and a better understanding of users and customers.
- Machine learning is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn.
Making Recommendations
- Use the preferences of a group of people to make recommendations to other people.
- Collaborative filtering: A collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours. It looks at other things they like and combines them to create a ranked list of suggestions.
- Similarity score:
  - Euclidean distance
    - score = 1 / (1 + distance)
    - 0 <= score <= 1
  - Pearson correlation coefficient
    - How well two sets of data fit on a straight line
    - When data isn’t well normalized
  - Jaccard coefficient
  - Manhattan distance
  - http://en.wikipedia.org/wiki/ Metric_%28mathematics%29#Examples
- User-based collaborative filtering: Use of all the rankings from every user in order to create a dataset.
- Item-based collaborative filtering:
  - comparisons between items will not change as often as comparisons between users
  - precompute the most similar items for each item
  - build the dataset once and reuse it each time you need it
- Item-based filtering usually outperforms user-based filtering in sparse datasets, and the two per- form about equally in dense datasets.
- User-based filtering is simpler to implement and doesn’t have the extra steps, so it is often more appropriate with smaller in-memory datasets that change very frequently.
Discovering Groups
- Discover and visualize groups of things, people, or ideas that are all closely related.
- Supervised learning methods:Techniques that use example inputs and outputs to learn how to make predictions.
- Unsupervised learning: Find structure within a set of data where no one piece of data is the answer.
- Hierarchical clustering
  - Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item. In each iteration this method calculates the distances between every pair of groups, and the closest ones are merged together to form a new group. This is repeated until there is only one group.
- k-mean clustering
  - K-means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing.
- Tanimoto coefficient: the ratio of the intersection set to the union set
- Multidimensional scaling: Take the difference between every pair of items and try to make a chart in which the distances between the items match those differences.
Searching and Ranking
- Search a large set of documents for a list of words, and rank results according to how relevant the documents are to those words.
- Content-based ranking uses several possible metrics with just the content of the page to determine the relevance of the query.
  - Word frequency
  - Document location
  - Word distance
- Inbound-link ranking uses the link structure of the site to determine what’s important.
  - Simple count
  - The PageRank algorithm
  - Using the link text
- Multilayer perceptron neural network
  - Training algorithm: backpropagation
  - [Input data is zero or one??]
Optimization
- Find the best solution to a problem by trying many different solutions and scoring them to determine their quality.
- Typically used in cases where there are too many possible solutions to try them all.
- Random searching
  - Serve as a baseline so you can see if the other algorithms are doing a good job.
- Hill climbing
  - Start with a random solution and look at the set of neighboring solutions for those that are better (have a lower cost function).
- Simulated annealing
  - it is willing to accept a worse solution near the beginning of the process. As the process goes on, the algorithm becomes less and less likely to accept a worse solution, until at the end it will only accept a better solution.
- Genetic algorithms
  - The top solutions in the current population are added to the new population as they are. The rest of the new population consists of completely new solutions that are created by modifying (mutation or crossover) the best solutions.
Document Filtering
- Classify documents based on their contents, e.g. spam email
- Calculate the probability that a word is in a particular category by dividing the number of times the word appears in a document in that category by the total number of documents in that category.
- Naive Bayesian classifier
  - The probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category.
  - Pr(Category | Document) = Pr(Document | Category) x Pr(Category) / Pr(Document)
- Choosing a category
  - In many applications the categories can’t be considered equal, and in some applications it’s better for the classifier to admit that it doesn’t know the answer than to decide that the answer is the category with a marginally higher probability.
  - Set up a minimum threshold for each category. I.e., for a new item to be classified into a particular category, its probability must be a specified amount larger than the probability for any other category.
- The Fisher method
  - The methods perform better when there are equal numbers of documents in each category.
  - Multiply all the probabilities together, then take the natural log, and then multiplying the result by –2.
  - If the probabilities were independent and random, the result of this calculation would fit a chi-squared distribution.
  - You can specify the lower bounds for each classification.
- That neural network can be adapted to work on the same problems in this chapter by using the features as inputs and having outputs representing each of the possible classifications.
- Likewise, support vector machines can be applied to the problems in this chapter.
- The reason Bayesian classifiers are often used for document classification is that they require far less computing power than other methods do.
- On the other hand, neural networks and support-vector machines have one big advantage over the classifiers presented in this chapter: they can capture more complex relationships between the input features.
Modeling with Decision Trees
- You can understand the reasoning process of a decision tree just by looking at it, and you can even convert it to a simple series of if-then statements.
- CART (Classification and Regression Trees)
- Calculate how mixed a set is: Gini impurity and entropy
- Gini impurity: the expected error rate if one of the results from a set is randomly applied to one of the items in the set.
- Entropy: the amount of disorder in a set
- Pruning involves checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold. If so, the leaves are merged into a single node with all the possible outcomes.
- Another advantage of decision trees is their ability to deal with missing data.
- When you have a tree with numerical outcomes, you can use variance as a scoring function instead of entropy or Gini impurity.
Building Price Models
- Make numerical predictions based on examples they have seen before, and even display probability distributions for the predictions to help the user interpret how the prediction is being made.
- An important part of making numerical predictions is determining which variables are important and in what combinations.
- k-Nearest Neighbors
  - Choosing the correct number of neighbors can be done manually for different datasets, or it can be done with optimization.
- Weighted neighbors: inverse function, subtraction function, Gaussian function
- Scaling dimensions using optimization
- An advantage of optimizing the variable scales in this way is that you immediately see which variables are important and how important they are.
- Estimating the probability density
Advanced Classification: Kernel Methods and SVMs
- The nonlinearity and the interplay of the variables.
- The decision trees are often a poor way to determine the class in problems with multiple numerical inputs that don’t exhibit simple relationships.
- Basic linear classification
  - It works by finding the average of all the data in each class and constructing a point that represents the center of the class. It can then classify new points by determining to which center point they are closest.
- I t’s a good idea to put all the data on a common scale so that differences are comparable in every variable.
- Kernel methods
  - The kernel trick involves replacing the dot-product function with a new function that returns what the dot-product would have been if the data had first been transformed to a higher dimensional space using some mapping function.
  - Radial-basis function
  - So, instead of calculating the dot-product between the point you’re trying to classify and the average point for a class, you calculate the dot-product or the radial-basis function between the point and every other point in the class, and then average them.
- Support-vector machines
  - maximum-margin hyperplane
  - The algorithm for training a support-vector machine involves mathematical concepts that are very computationally intensive.
Finding Independent Features
- This chapter will look at ways to extract the important underlying features from sets of data that are not labeled with specific outcomes.
- Feature extraction: it tries to find new data rows that can be used in combination to reconstruct rows of the original dataset.
- non-negative matrix factorization (NMF)
Evolving Intelligence
- Genetic programming
- Genetic programming is a machine-learning technique inspired by the theory of bio- logical evolution.
- Genetic algorithms are an optimization technique that use the idea of evolutionary pressure to choose the best result. With any form of optimization, you have already selected an algorithm or metric and you’re simply trying to find the best parameters for it.
- It is possible to force the programs to remain simple, but in many cases this will make it more difficult to find a good solution. A better way to deal with this issue is to allow the programs to evolve to a good solution and then remove and simplify unnecessary portions of the tree. You can do this manually, and in some cases you can do it automatically using a pruning algorithm.