What are the main A/B test prioritization frameworks?

ICE (Impact, Confidence, Ease), PIE (Potential, Importance, Ease), RICE (Reach, Impact, Confidence, Effort), and PXL (CXL’s framework that replaces gut scores with concrete questions). All reward high expected return and low cost weighted by genuine confidence.

What is PXL and why use it over ICE?

PXL is CXL’s framework that swaps subjective 1–10 scores for concrete, mostly yes/no questions (above the fold? addresses a research finding? high-traffic page? big change?). This makes scores consistent across people and explicitly rewards research-backed tests, fighting the confidence inflation that makes ICE rankings noisy.

How should I score confidence?

By the strength and convergence of evidence behind the idea — multiple research methods agreeing, a prior related win, or an established principle — not by how much you like it. Confidence is the score that most drives win rate, so it must reflect evidence, not enthusiasm.

How do I handle a senior stakeholder's pet idea?

Use a transparent, evidence-weighted backlog. The score orders the queue, so you can decline a pet idea without declining the person, and anyone can raise an idea’s rank by bringing research. This is the standard defense against HiPPO-driven roadmaps.

How often should the test backlog be re-prioritized?

On a regular cadence (weekly or bi-weekly). Re-score with new research and finished-test learnings, retire ideas a test disproved, and keep the next well-powered test queued so the program never idles.

RGM-204 · CRO & Experimentation · Module 4 of 6

Prioritization frameworks

With more ideas than traffic, prioritization — not creativity — separates high-performing programs from busy ones. This module compares ICE, PIE, RICE, and PXL; shows why confidence must be scored by evidence (and how PXL fights bias); estimates impact and effort honestly including traffic cost; and turns the backlog into a transparent, HiPPO-proof, living queue.

What you will learn10 sections▾

01Why prioritization separates programs 02The major frameworks 03PXL: prioritization that fights bias 04Estimating impact honestly 05Confidence: research-backed vs gut 06Estimating effort 07Prioritization and stakeholder politics 08Backlog cadence 09Where prioritization fails 10Your prioritization checklist

Why prioritization separates programs

With more test ideas than traffic to run them, prioritization — not creativity — is what separates high-performing programs from busy ones. A scoring framework forces every idea to justify its slot by expected impact, confidence, and effort, so your scarce experiment capacity goes to the bets most likely to pay off and teach. Without it, the loudest stakeholder’s pet idea wins, traffic is squandered on low-leverage tweaks, and the program’s results stall while everyone stays busy.

RGM tool · Score your backlog objectively with the PXL Prioritization Scorer (or compare with the ICE and RICE calculators).

The hard constraint is real: most sites can only run a handful of well-powered tests at a time, so every test you run is several you didn’t. Prioritization is how you make that tradeoff deliberately instead of by politics or recency. It also depersonalizes the backlog — an idea earns its place by score, not by who proposed it, which is as much a political tool as an analytical one.

The success of your testing program is the sum of two factors: the number of tests you run, and the percentage that win.

Peep Laja, founder of CXL — CXL

The major frameworks

Four frameworks dominate. ICE (Impact, Confidence, Ease) — fast, simple, subjective. PIE (Potential, Importance, Ease) — similar, page-focused. RICE (Reach, Impact, Confidence, Effort) — adds reach, popular in product. PXL (CXL’s framework) — replaces gut scores with concrete, mostly yes/no questions to fight bias. They share a logic: reward high expected return and low cost, weighted by how confident you genuinely are. Pick one, use it consistently, and don’t pretend the numbers are more precise than they are.

ICE — fast and subjective

Score Impact, Confidence, and Ease 1–10 and rank by the product or average. Brilliantly quick, but its scores are gut feel, so two people score the same idea very differently.

THE MOVE · Use ICE for speed and a rough cut; pair it with research so ‘Confidence’ isn’t pure optimism. Try the ICE calculator.

PIE — page-focused

Potential (improvement room), Importance (traffic/value), Ease (implementation). Similar to ICE, oriented to which page to optimize.

THE MOVE · Good for choosing which page or template to attack first; still subjective, so anchor it in analytics.

RICE — adds reach

Reach × Impact × Confidence ÷ Effort. The explicit Reach term stops you over-investing in changes few users see.

THE MOVE · Strong when ideas differ a lot in how many users they touch. Use the RICE calculator.

PXL — bias-resistant

CXL’s framework replaces 1–10 gut scores with concrete, mostly binary questions (is it above the fold? does it address a research-found issue? how big is the change?), making scores more objective and consistent.

THE MOVE · Use PXL when you want scores that hold up across people and tie directly to research — the most defensible of the four.

PXL: prioritization that fights bias

PXL, CXL’s framework, exists because ICE’s 1–10 scores are wishful thinking dressed as math — everyone rates their own idea’s impact a 9. PXL instead asks concrete, mostly yes/no questions: Is the change above the fold? Is it noticeable in five seconds? Is it on a high-traffic page? Does it address an issue found in research? Is it a big change or a tweak? Summing objective answers yields scores that are consistent across people and explicitly reward research-backed, high-visibility, high-traffic tests.

The deeper point PXL encodes: confidence should come from research, not enthusiasm. Its scoring literally gives more points to ideas backed by user research, analytics, or prior tests — operationalizing the truth that win rate is driven by evidence. If you only adopt one thing from this module, make it ‘score confidence by the strength of the evidence behind the idea, not by how much you like it.’

RGM EXPERT TRICK

Make ‘confidence’ a function of evidence, not enthusiasm

In every gut-scored framework, ‘Confidence’ quietly becomes ‘how much the proposer likes it’ — so every idea scores high and the ranking is meaningless. Confidence inflation is what makes ICE backlogs useless.

I redefine the confidence score by a strict evidence ladder: backed by multiple research sources or a prior winning test = high; one signal = medium; ‘I think’ = low, full stop. The score can’t be argued up with passion, only with evidence.

Suddenly the backlog reorders itself around what we actually have reason to believe — which is exactly the input that drives win rate.

WHY IT’S RARE · Everyone games the confidence score with optimism. Pinning it to a concrete evidence ladder is what turns a prioritization framework from theater into a real filter.

Estimating impact honestly

Impact estimates are where prioritization gets fooled. The honest approach ties expected impact to funnel leverage (traffic × drop-off of the step you’re affecting) and to realistic effect sizes (most wins are single-digit percentage lifts, not the doubling in the case studies). Estimating impact in absolute terms — ‘this could add ~X conversions/month’ — is more useful than a 1–10 guess, because it forces you to confront how few users a change actually reaches.

The discipline is humility about effect size. Teams routinely score a button tweak ‘impact 9’ while the math says even a generous lift on that step adds a rounding error. Anchoring impact to reach and to the modest effect sizes real programs see prevents the most common prioritization error: over-rating low-leverage changes because they’re easy to imagine winning big.

Confidence: research-backed vs gut

Confidence is the score that should do the most work and usually does the least. A test idea’s confidence should rise with the strength and convergence of evidence behind it: multiple research methods pointing the same way (analytics + session recordings + survey), a prior related win, or a well-established principle beat a lone hunch. Low-confidence ideas aren’t forbidden — they’re just cheaper bets you run when capacity allows, not the ones you stake prime traffic on.

Estimating effort

Effort (or its inverse, ease) keeps prioritization honest about cost: design, engineering, and QA time, plus opportunity cost on shared traffic. A high-impact idea that needs a quarter of engineering may rank below a medium-impact idea shippable this week, because the second one teaches you something now. Estimate effort with the people who’ll build it, not optimistically, and remember the scarcest resource is often not engineering hours but the traffic the test will consume.

RGM EXPERT TRICK

Price tests in traffic, not just engineering hours

Effort scores almost always mean ‘dev time,’ and ignore the resource that’s usually scarcer: the traffic a test occupies while it runs. A six-week test on your highest-traffic page has an enormous opportunity cost no ‘ease’ score captures.

So I add traffic-cost to the effort side of the ledger: how much of our limited testing capacity, for how long, does this consume? A quick-to-build test that needs huge sample to power is not actually ‘easy.’

Counting traffic as a cost reshuffles the backlog toward tests that are cheap in both senses — fast to build and fast to power.

WHY IT’S RARE · Teams optimize for build effort and forget that running the test spends their rarest asset. Pricing tests in traffic is what stops one slow test from blocking five fast ones.

Prioritization and stakeholder politics

A scoring framework is also a political instrument: it converts ‘the VP wants the homepage hero changed’ into ‘here’s where that idea ranks, and why.’ A transparent, evidence-weighted backlog lets you say no to pet projects without saying no to a person — the score does. It also gives stakeholders a legitimate path in: bring evidence and the idea rises. Used well, prioritization defuses the single biggest threat to a program’s integrity, the HiPPO (Highest-Paid Person’s Opinion).

Claim: HiPPO-driven decisions — shipping the highest-paid person’s opinion without a test — are a documented antipattern in experimentation culture; a transparent prioritization framework is the standard defense. Source: Kohavi et al. / CXL on experimentation culture. Context: Let the framework, not the org chart, order the backlog — and give the HiPPO a fair path: bring evidence, climb the ranking.

RGM EXPERT TRICK

Reserve ~20% of capacity for bold, low-confidence swings

A purely score-driven backlog quietly biases toward safe, incremental, high-confidence tests — and those produce small, incremental wins. The big breakthroughs come from bolder ideas that, by definition, score lower on confidence.

So I ring-fence roughly a fifth of testing capacity as an innovation budget for high-uncertainty, high-upside swings that wouldn’t survive the normal ranking. The other 80% runs the disciplined, research-backed queue.

Pure prioritization optimizes you into a local maximum; a deliberate exploration budget is how you occasionally find a higher hill.

WHY IT’S RARE · Everyone’s backlog drifts toward safe incremental tests because they score well. Carving out an explicit budget for bold bets is how programs keep finding step-changes, not just nudges.

Backlog cadence

Prioritization isn’t a one-time ranking — it’s a living backlog on a cadence. New research generates new hypotheses; finished tests (especially losses) update the scores of related ideas; and reach/effort estimates change as the site does. Run a regular backlog grooming (often weekly or bi-weekly) where you re-score with the latest evidence, retire ideas a test just disproved, and always have the next well-powered test queued so the program never idles waiting for someone to decide.

Which prioritization framework should I use?: Any consistently-applied one beats none. ICE/PIE for speed, RICE when reach varies a lot, PXL when you want bias-resistant, research-tied scores. The framework matters less than honestly scoring confidence by evidence and pricing tests in traffic.
How do I stop the HiPPO from hijacking the roadmap?: Use a transparent, evidence-weighted backlog. The score, not the person, orders the queue, and anyone (including the HiPPO) can raise an idea’s rank by bringing research. It lets you decline a pet idea without declining the person.
How often should I re-prioritize?: On a regular cadence (weekly or bi-weekly grooming). Re-score with new research and finished-test learnings, retire disproven ideas, and keep the next powered test queued so the program never idles.

Where prioritization fails

Prioritization fails when: confidence is scored by enthusiasm not evidence, impact is over-rated on low-reach changes, effort ignores traffic cost, the framework is abandoned the moment a senior stakeholder pushes, scores are treated as precise truth, or the backlog goes stale and stops absorbing new learning. Each lets low-leverage or untested ideas consume the traffic that high-leverage tests needed.

Confidence by enthusiasm

Everyone scores their idea’s confidence high, so the ranking is noise.

THE MOVE · Tie confidence to a strict evidence ladder; passion can’t raise the score, only evidence can.

Over-rated impact

Low-reach tweaks get ‘impact 9’ that the funnel math contradicts.

THE MOVE · Anchor impact to reach × drop-off and realistic single-digit effect sizes.

Effort ignores traffic

‘Easy to build’ tests can be expensive to power on scarce traffic.

THE MOVE · Add traffic-cost to effort; a quick build that needs huge sample isn’t easy.

HiPPO override

Abandoning the framework for the senior pet idea destroys its legitimacy.

THE MOVE · Let the score order the backlog; give stakeholders a path in via evidence.

Stale backlog

A ranking that never updates stops absorbing research and learnings.

THE MOVE · Groom on a cadence: re-score with new evidence, retire disproven ideas, queue the next powered test.

Your prioritization checklist

Prioritization is a repeatable discipline. Tick what is genuinely true of your backlog.

CASE-method test

Prove it. Earn your passcode.

Ten questions, CASE method (Context · Analysis · Strategy · Execution). Pass at 90% to unlock this module’s completion passcode — retake as many times as you like.