WillsEducation
AI & ML19 min readPublished 5 April 2026

NetflixNetflix's recommender: the $1B algorithm hiding in plain sight.

Recommendation SystemsPersonalisationA/B Testing
Netflix case study cover

The story

A technical walkthrough of the ranking, re-ranking, and artwork-personalisation stack that drives 80% of what you watch, plus the experimentation discipline that separates Netflix from every other streaming service.

What you’ll learn

  • 01Two-tower retrieval vs. sequential ranking, when each belongs in a recsys
  • 02Thompson sampling for artwork personalisation without killing creative brand
  • 03Why Netflix runs 250+ A/B tests a year and how experiment velocity compounds

The full breakdown

5 sections · 19 min read

Chapter 01

Why 80% of what you watch is chosen by an algorithm

Netflix has stated publicly that more than 80% of viewing hours come from algorithmic recommendations rather than active search. That single number reframes Netflix as a recommendations company that happens to license content, not the other way around. Every dollar spent on a show that nobody watches is a dollar wasted; every minute spent picking what to watch is a minute the user might spend cancelling instead. The recommender is a customer-retention engine in a way no other Netflix system is.

80%

Of viewing hours

from recommendations, not search

$1B+

Annual value

attributed to the recommender

250+

Live A/B tests

running simultaneously

The economic value of getting this right is hard to overstate. Reed Hastings has cited internal estimates of $1B+ in annual subscriber value attributable to the recommender. That number predates Netflix's pivot to ads and live sport, both of which raise the stakes further, because both rely on accurate prediction of viewing intent.

Chapter 02

The two-stage stack: retrieval, then ranking

Modern Netflix recommendations follow the standard two-stage pattern that powers most large-scale recsys: retrieval first, then ranking. Retrieval has to evaluate ~20,000 titles per user in milliseconds; ranking has the luxury of evaluating maybe 200 candidates with much heavier features.

RetrievalRanking
Candidates evaluated~20,000 titles~200 titles
Latency budget< 50ms< 150ms
Model classTwo-tower embeddingsMulti-objective deep model
Optimised forRecall (don't miss good stuff)Precision (rank what's left)
Features used~50 (light)~500 (heavy)

Retrieval at Netflix is a classic two-tower deep learning model. One tower embeds the user (their watch history, completion patterns, hover-vs-click signals, time-of-day patterns). Another tower embeds the title (genre, cast, runtime, language, but also collaborative filtering signals from similar users). At serve time, retrieval is just a nearest-neighbour lookup in the embedding space, fast, parallelisable, and cheap.

Ranking is where things get interesting. Netflix uses a multi-objective ranker that has to optimise for completion (will the user finish this), retention (does watching this make them less likely to cancel next month), and diversity (don't show me four cooking shows in a row even if I'd technically watch them). The trade-offs between these objectives are tuned via online experiments, not chosen by hand.

Chapter 03

Personalised artwork, Thompson sampling at the thumbnail

The most under-appreciated piece of the Netflix stack is artwork personalisation. The same show ships with 5-10 candidate images. The system picks one per user based on what they've engaged with historically, a romance fan sees the romantic subplot, a comedy fan sees the funny side, a thriller fan sees the suspense.

This is implemented as a multi-armed bandit (specifically Thompson sampling). Each artwork variant is an arm; the reward is play rate. The bandit is contextual, it conditions on user features so the same artwork can be optimal for one user and sub-optimal for another. This sounds simple but the operational complexity is enormous: who picks the candidate set? How do you stop the bandit from converging too quickly and missing slow-burn artwork? How do you handle a brand-new title with no signal at all?

Netflix's answer to the cold-start problem is a separate "creative quality" model that scores candidate images using historical engagement on similar titles before any real-user exposure. New artwork enters the bandit with informed priors instead of uniform priors, saving weeks of low-confidence exploration.

Chapter 04

The experimentation discipline: 250+ A/B tests a year

Most companies run A/B tests. Very few run them with the discipline Netflix has built. Every meaningful change to the product, UI, recommendations, even artwork variants, ships through their experimentation platform. They run 250+ live tests at any given time, with overlap matrices that allow many changes to be evaluated simultaneously without statistical interference.

How an A/B idea actually dies at Netflix

Of every 100 ideas that enter the pipeline, only ~7 reach a full launch, guardrail metrics kill more wins than the primary metric ever does.

Two operational details are worth copying. First: the team has invested heavily in guardrail metrics, secondary metrics that flag when a winning A/B has a hidden cost (a recommender change that boosts engagement but hurts long-term retention, for example). The guardrails kill more ideas than the primary metric ever does.

Second: Netflix runs hold-out groups for measuring causal impact across systems. A small slice of users is permanently held out of any new feature for a quarter; the difference between this group and the main population is the team's read on whether all the experimentation in aggregate is actually moving the business. This is rare in industry, most A/B platforms only measure the immediate impact of each test in isolation.

Chapter 05

What you can copy from this playbook

Two ideas translate directly to almost any data product. First, structure your recommender as retrieval-then-ranking from day one, even if the catalogue is small enough that you could brute-force it. The pattern forces you to separate "can the model see this candidate at all" from "how should we rank what we can see," and that separation is what lets you scale later.

Second, treat your experimentation platform as a product, not a tool. Netflix's edge is not the algorithms (much of which they've published), it's the feedback loop. If your team needs to file a ticket and wait two weeks to ship an A/B test, you don't have an experimentation culture, you have an experimentation backlog. The orgs that win on personalisation are the ones where a PM can scope, ship, and read out a test in a single sprint.

We don't build for the average user. We build for the user who's about to cancel, and the recommender is what changes their mind.
, Paraphrased Netflix product principle

Mentor commentary

If you want to understand modern data science as a business function, Netflix is the single best teaching example on the planet.
ER

Emma Roberts

Data Science Advisor, ex-Deliveroo

Alumni outcome

AK

Ananya Krishnan

Excel Analyst, Mid-size SaaS Data Analyst, Flipkart

The Netflix case study on experimentation velocity rewired how I think about A/B tests. I built my final dashboard around the same guardrail metric patterns they use internally.

Ready to apply this playbook?

Our ai & ml programs turn breakdowns like this into portfolio work you can ship.