Hacking Netflix Prize: Calculating 316 million movie correla...
KNN is one of the most popular CF algorithms. Central to KNN is how to define the neighborhood relationship, or the distance between two objects. For this purpose, Pearson Correlation is a pretty good measure is used frequently (see my previous post for my basic KNN approach). In the Netflix dataset, there are 17770 movies, so we need to calculate 17770*17770, or about 316 million Pearson correlations. Not a trivial task. In this post, I'll describe the tricks I used to optimize my Pearson Correlation calculation, which cut my running time from 2.5 hours to less than 2 minutes. It won't help your RMSE directly, but it may help indirectly by allowing you to explore the KNN parameter space faster. And although I used Pearson Correlation, the methods described in this post can be applied to many other neighborhood measures too.
http://dmnewbie.blogspot.com/2009/06/calculating-316-million...
tags: knn ml
Eureqa | Cornell Computational Synthesis Laboratory
Eureqa is a software tool for detecting equations and hidden mathematical relationships in your data. Its primary goal is to identify the simplest mathematical formulas which could describe the underlying mechanisms that produced the data.
http://ccsl.mae.cornell.edu/eureqa
tags: data visualization viz ai machinelearning ml
Netflix prize tribute: Recommendation algorithm in Python | ...
In honor of the prize barrier being broken, I put together a little implementation of an early leader's approach to the problem. They experimented with several different approaches, but the one I use the most is the original probabilistic matrix factorization (PMF) approach. To see all the details, including how it performs on the full Netflix problem, see Russ and Andrei's paper.
http://blog.smellthedata.com/2009/06/netflix-prize-tribute-r...
tags: python netflix ml machinelearning
Latent Dirichlet allocation - Wikipedia, the free encycloped...
In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups which explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA was first presented as a graphical model for topic discovery and was developed by David Blei, Andrew Ng, and Michael Jordan in 2003.[1]
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
tags: ml
Machine Learning (Theory)
http://hunch.net/
tags: ml
| |
|