The collective conclusion from 2006’s Netflix Prize competition was this: You can’t beat massive model ensembles. Ten years on, I’m still convinced the larger lesson was completely overlooked: that sometimes, when modeling human behavior, there are better KPIs than RMSE.
In 2011, I began work on my own SaaS recommendation service (called Selloscope). As part of that work, I evaluated the performance of a Slope-One model (similar at the time to Netflix’s Cinematch algorithm) against a spiffied-up object-to-object collaborative filter. The latter doesn’t produce an error score, so I devised a different goalpost: If I remove random recommendations from each user’s profile, how well does the model fill in those gaps?