Wills Education

Chapter 01

Why 80% of what you watch is chosen by an algorithm

Netflix has stated publicly that more than 80% of viewing hours come from algorithmic recommendations rather than active search. That single number reframes Netflix as a recommendations company that happens to license content, not the other way around. Every dollar spent on a show that nobody watches is a dollar wasted; every minute spent picking what to watch is a minute the user might spend cancelling instead. The recommender is a customer-retention engine in a way no other Netflix system is.

80%

Of viewing hours

from recommendations, not search

$1B+

Annual value

attributed to the recommender

250+

Live A/B tests

running simultaneously

The economic value of getting this right is hard to overstate. Reed Hastings has cited internal estimates of $1B+ in annual subscriber value attributable to the recommender. That number predates Netflix's pivot to ads and live sport, both of which raise the stakes further, because both rely on accurate prediction of viewing intent.

Chapter 02

The two-stage stack: retrieval, then ranking

Modern Netflix recommendations follow the standard two-stage pattern that powers most large-scale recsys: retrieval first, then ranking. Retrieval has to evaluate ~20,000 titles per user in milliseconds; ranking has the luxury of evaluating maybe 200 candidates with much heavier features.

	Retrieval	Ranking
Candidates evaluated	~20,000 titles	~200 titles
Latency budget	< 50ms	< 150ms
Model class	Two-tower embeddings	Multi-objective deep model
Optimised for	Recall (don't miss good stuff)	Precision (rank what's left)
Features used	~50 (light)	~500 (heavy)

Retrieval at Netflix is a classic two-tower deep learning model. One tower embeds the user (their watch history, completion patterns, hover-vs-click signals, time-of-day patterns). Another tower embeds the title (genre, cast, runtime, language, but also collaborative filtering signals from similar users). At serve time, retrieval is just a nearest-neighbour lookup in the embedding space, fast, parallelisable, and cheap.

Ranking is where things get interesting. Netflix uses a multi-objective ranker that has to optimise for completion (will the user finish this), retention (does watching this make them less likely to cancel next month), and diversity (don't show me four cooking shows in a row even if I'd technically watch them). The trade-offs between these objectives are tuned via online experiments, not chosen by hand.

Chapter 03

Personalised artwork, Thompson sampling at the thumbnail

The most under-appreciated piece of the Netflix stack is artwork personalisation. The same show ships with 5-10 candidate images. The system picks one per user based on what they've engaged with historically, a romance fan sees the romantic subplot, a comedy fan sees the funny side, a thriller fan sees the suspense.

This is implemented as a multi-armed bandit (specifically Thompson sampling). Each artwork variant is an arm; the reward is play rate. The bandit is contextual, it conditions on user features so the same artwork can be optimal for one user and sub-optimal for another. This sounds simple but the operational complexity is enormous: who picks the candidate set? How do you stop the bandit from converging too quickly and missing slow-burn artwork? How do you handle a brand-new title with no signal at all?

Netflix's answer to the cold-start problem is a separate "creative quality" model that scores candidate images using historical engagement on similar titles before any real-user exposure. New artwork enters the bandit with informed priors instead of uniform priors, saving weeks of low-confidence exploration.

Chapter 04

The experimentation discipline: 250+ A/B tests a year

Most companies run A/B tests. Very few run them with the discipline Netflix has built. Every meaningful change to the product, UI, recommendations, even artwork variants, ships through their experimentation platform. They run 250+ live tests at any given time, with overlap matrices that allow many changes to be evaluated simultaneously without statistical interference.

How an A/B idea actually dies at Netflix

Of every 100 ideas that enter the pipeline, only ~7 reach a full launch, guardrail metrics kill more wins than the primary metric ever does.

Two operational details are worth copying. First: the team has invested heavily in guardrail metrics, secondary metrics that flag when a winning A/B has a hidden cost (a recommender change that boosts engagement but hurts long-term retention, for example). The guardrails kill more ideas than the primary metric ever does.

Second: Netflix runs hold-out groups for measuring causal impact across systems. A small slice of users is permanently held out of any new feature for a quarter; the difference between this group and the main population is the team's read on whether all the experimentation in aggregate is actually moving the business. This is rare in industry, most A/B platforms only measure the immediate impact of each test in isolation.

Chapter 05

What you can copy from this playbook

Two ideas translate directly to almost any data product. First, structure your recommender as retrieval-then-ranking from day one, even if the catalogue is small enough that you could brute-force it. The pattern forces you to separate "can the model see this candidate at all" from "how should we rank what we can see," and that separation is what lets you scale later.

Second, treat your experimentation platform as a product, not a tool. Netflix's edge is not the algorithms (much of which they've published), it's the feedback loop. If your team needs to file a ticket and wait two weeks to ship an A/B test, you don't have an experimentation culture, you have an experimentation backlog. The orgs that win on personalisation are the ones where a PM can scope, ship, and read out a test in a single sprint.

“We don't build for the average user. We build for the user who's about to cancel, and the recommender is what changes their mind.”

, Paraphrased Netflix product principle

Chapter 06

The Multi-Objective Future: Live Sports & Ad Matching

With Netflix's strategic pivot into live broadcasting (such as NFL Christmas Day games and WWE Raw) and ad-supported subscription tiers, the recommender must solve new real-time challenges. Recommending live events where there is zero historic playback history requires leveraging real-time cohort graphs and rapid user search interest signals.

	Traditional Ad Stitching	Algorithmic Ad Personalization
Targeting logic	Static broadcast blocks	Real-time viewer profile embeddings
Latency constraints	None (pre-scheduled)	Sub-50ms during live streams
Completion rates	Moderate	High due to context relevance

NetflixNetflix's recommender: the $1B algorithm hiding in plain sight.

Why 80% of what you watch is chosen by an algorithm

The two-stage stack: retrieval, then ranking

Personalised artwork, Thompson sampling at the thumbnail

The experimentation discipline: 250+ A/B tests a year

What you can copy from this playbook

The Multi-Objective Future: Live Sports & Ad Matching

More in AI & ML

Inside OpenAI's path from research lab to $3B ARR platform.

Spotify's discovery engine, how Wrapped became a growth loop.

Stripe Radar: fraud detection as a developer primitive.

Ready to apply this playbook?