Chapter 01
Why 80% of what you watch is chosen by an algorithm
Netflix has stated publicly that more than 80% of viewing hours come from algorithmic recommendations rather than active search. That single number reframes Netflix as a recommendations company that happens to license content, not the other way around. Every dollar spent on a show that nobody watches is a dollar wasted; every minute spent picking what to watch is a minute the user might spend cancelling instead. The recommender is a customer-retention engine in a way no other Netflix system is.
80%
Of viewing hours
from recommendations, not search
$1B+
Annual value
attributed to the recommender
250+
Live A/B tests
running simultaneously
The economic value of getting this right is hard to overstate. Reed Hastings has cited internal estimates of $1B+ in annual subscriber value attributable to the recommender. That number predates Netflix's pivot to ads and live sport, both of which raise the stakes further, because both rely on accurate prediction of viewing intent.
Chapter 02
The two-stage stack: retrieval, then ranking
Modern Netflix recommendations follow the standard two-stage pattern that powers most large-scale recsys: retrieval first, then ranking. Retrieval has to evaluate ~20,000 titles per user in milliseconds; ranking has the luxury of evaluating maybe 200 candidates with much heavier features.
| Retrieval | Ranking | |
|---|---|---|
| Candidates evaluated | ~20,000 titles | ~200 titles |
| Latency budget | < 50ms | < 150ms |
| Model class | Two-tower embeddings | Multi-objective deep model |
| Optimised for | Recall (don't miss good stuff) | Precision (rank what's left) |
| Features used | ~50 (light) | ~500 (heavy) |
Retrieval at Netflix is a classic two-tower deep learning model. One tower embeds the user (their watch history, completion patterns, hover-vs-click signals, time-of-day patterns). Another tower embeds the title (genre, cast, runtime, language, but also collaborative filtering signals from similar users). At serve time, retrieval is just a nearest-neighbour lookup in the embedding space, fast, parallelisable, and cheap.
Ranking is where things get interesting. Netflix uses a multi-objective ranker that has to optimise for completion (will the user finish this), retention (does watching this make them less likely to cancel next month), and diversity (don't show me four cooking shows in a row even if I'd technically watch them). The trade-offs between these objectives are tuned via online experiments, not chosen by hand.
Chapter 03
Personalised artwork, Thompson sampling at the thumbnail
The most under-appreciated piece of the Netflix stack is artwork personalisation. The same show ships with 5-10 candidate images. The system picks one per user based on what they've engaged with historically, a romance fan sees the romantic subplot, a comedy fan sees the funny side, a thriller fan sees the suspense.
This is implemented as a multi-armed bandit (specifically Thompson sampling). Each artwork variant is an arm; the reward is play rate. The bandit is contextual, it conditions on user features so the same artwork can be optimal for one user and sub-optimal for another. This sounds simple but the operational complexity is enormous: who picks the candidate set? How do you stop the bandit from converging too quickly and missing slow-burn artwork? How do you handle a brand-new title with no signal at all?
Netflix's answer to the cold-start problem is a separate "creative quality" model that scores candidate images using historical engagement on similar titles before any real-user exposure. New artwork enters the bandit with informed priors instead of uniform priors, saving weeks of low-confidence exploration.
Chapter 04
The experimentation discipline: 250+ A/B tests a year
Most companies run A/B tests. Very few run them with the discipline Netflix has built. Every meaningful change to the product, UI, recommendations, even artwork variants, ships through their experimentation platform. They run 250+ live tests at any given time, with overlap matrices that allow many changes to be evaluated simultaneously without statistical interference.
How an A/B idea actually dies at Netflix
Of every 100 ideas that enter the pipeline, only ~7 reach a full launch, guardrail metrics kill more wins than the primary metric ever does.
Two operational details are worth copying. First: the team has invested heavily in guardrail metrics, secondary metrics that flag when a winning A/B has a hidden cost (a recommender change that boosts engagement but hurts long-term retention, for example). The guardrails kill more ideas than the primary metric ever does.
Second: Netflix runs hold-out groups for measuring causal impact across systems. A small slice of users is permanently held out of any new feature for a quarter; the difference between this group and the main population is the team's read on whether all the experimentation in aggregate is actually moving the business. This is rare in industry, most A/B platforms only measure the immediate impact of each test in isolation.
Chapter 05
What you can copy from this playbook
Two ideas translate directly to almost any data product. First, structure your recommender as retrieval-then-ranking from day one, even if the catalogue is small enough that you could brute-force it. The pattern forces you to separate "can the model see this candidate at all" from "how should we rank what we can see," and that separation is what lets you scale later.
Second, treat your experimentation platform as a product, not a tool. Netflix's edge is not the algorithms (much of which they've published), it's the feedback loop. If your team needs to file a ticket and wait two weeks to ship an A/B test, you don't have an experimentation culture, you have an experimentation backlog. The orgs that win on personalisation are the ones where a PM can scope, ship, and read out a test in a single sprint.
“We don't build for the average user. We build for the user who's about to cancel, and the recommender is what changes their mind.”