The collective conclusion from 2006’s Netflix Prize competition was this: You can’t beat massive model ensembles. Ten years on, I’m still convinced the larger lesson was completely overlooked: that sometimes, when modeling human behavior, there are better KPIs than RMSE.
In 2011, I began work on my own SaaS recommendation service (called Selloscope). As part of that work, I evaluated the performance of a Slope-One model (similar at the time to Netflix’s Cinematch algorithm) against a spiffied-up object-to-object collaborative filter. The latter doesn’t produce an error score, so I devised a different goalpost: If I remove random recommendations from each user’s profile, how well does the model fill in those gaps?
I’m still proud of that work we did on Selloscope, and I still believe we the larger lesson is waiting to be learned. Selloscope is long gone, so here’s the re-post of that comparison. Wish we’d made it 🙂
———————————————————-
In October 2006, the Netflix Prize brought recommendation technology into the public consciousness by offering a prize of US$1,000,000 to anyone who could improve Netflix’ own Cinematch algorithm. The contest was seemingly simple: Improve the ability to predict users’ ratings of films by 10%. Over 40,000 teams from 186 countries participated in this colossal experiment in open innovation to create a better theory of collaborative filtering and recommender systems. The stakes were high and the competition was fierce. Nearly three years later, Netflix awarded the million-dollar grand prize to the team (“BellKor’s Pragmatic Chaos”) which had achieved a 10.05% improvement over Cinematch — largely through the combination of algorithms from multiple teams and complex data fitting techniques.
At about that same time, social science researchers Matthew Salganik, Peter Dodds, and Duncan Watts were making exponential advances in the theory of recommendation technology through their behavioral studies of cultural markets while Netflix competitors struggled to produce just a 10% improvement. This work on cultural markets demonstrated that social influence — marketing, media attention, critical acclaim, keeping up with the Joneses — makes it very difficult to tell the most-popular goods apart from the almost-popular goods. Yet this groundbreaking work on social influence went seemingly unnoticed by the field of Netflix competitors. While recent advances such as PureSVD have have begun to outperform existing models, Netflix continues to frustrate its subscribers to this day by recommending an unending series of movies they probably won’t hate, but simply have little interest in seeing.
The Netflix Prize was pioneering. It was bold. It seemed right. It brought thousands of us together to make a tiny little piece of the world into something just a little bit better. Netflix deserves every bit of the good press it received during the Prize run. In hindsight, however, if the Prize was intended to advance the theory of recommendation technology, it was fundamentally flawed in two very significant ways:
Firstly, the Netflix datasets were so large that processing them in any novel way required either highly optimized data processing techniques or significant hardware that hobbyists were unlikely to have access to. It came as little surprise that the Prize winners hailed from corporate labs like AT&T Labs and Yahoo! Research — few else had access to that kind of hardware. In effect, the Netflix Prize was subsidized by other large corporations that possessed the specialized skillsets and the big iron needed to compete.
Secondly — and more importantly — Netflix’ choice to use root mean square error (RMSE) to evaluate performance fundamentally constrained the efforts of Prize participants to build increasingly complex optimizations of existing recommendation technologies. Concerns were raised that the Netflix Prize’s objective of a 10% improvement in RMSE might not even be noticeable to users. Recent work in cultural affinity had also shown that if your goal is to predict the best items for each user, those very items are the hardest to estimate correctly and inherently have the most prediction error. However, with the prize already underway, it was too late for major rule changes.
All this begs the question: Could Netflix have designed the Prize objective so that improvements in recommendations truly benefited its customers? For many that work in the field of personalization and recommender technology, and especially for those of us who have built Selloscope’s recommendation technology from the ground up, the answer is a resounding “Yes.”
Evaluating how relevant your recommendations are must involve comparing actual scores to predicted scores in some way. RMSE is one way to do that. Another approach is to see how highly each users’ favorite movies rank in the overall list of recommendations… that is, what percentage of a users’ top-rated items can your recommendation system correctly predict? This is a common-sense approach: If a users’ favorite movies aren’t in the top ten, or top hundred, or even top thousand recommendations, what kind of recommendations will you be making to your customers? In effect, the recommendation systems currently on the market present the user with a list of the best movies they don’t really want to see.
Selloscope has been working with a number of beta clients to develop and refine its recommendation technology. With the Netflix Prize dataset is no longer publicly available due to privacy concerns, we analyzed a set of ratings of 25,000 board games from thousands of users, nearly identical in structure to the Prize dataset. We made a benchmark comparison by comparing a Slope One collaborative filter (a very basic implementation of Netflix’ Cinematch model) to an out-of-the-box Selloscope model. How well does a standard collaborative filtering approach rank users’ top games as compared to Selloscope’s cultural affinity model?
Slope One | Selloscope | |
---|---|---|
Top 10 Recommendations | 0% | 6% |
Top 100 Recommendations | 2% | 33% |
Top 1000 Recommendations | 6% | 68% |
From the table above, the results are clear: Using Slope One, only two percent of users’ favorites show up in the top 100 actual recommendations. Selloscope returns 33% of users’ favorites in the top 100 recommendations, a 16x improvement. And we’re just getting warmed up.