A/B Test
Misleading
Case Study
An experimentation case study showing why a statistically significant conversion lift was not enough to ship globally.
The test won.
The decision was still no.
The treatment increased conversion from 14.80% to 15.56%, with +5.1% relative uplift. A simple read would have shipped Variant B. But the deeper analysis showed baseline imbalance, time decay, segment harm, failed guardrails, and lower-quality conversions.
CUPED reduced the lift
Raw uplift was +5.1%, but CUPED-adjusted uplift fell to +3.4%.
The effect faded after 48 hours
The result was carried by days 1 and 2, then became unreliable by day 10.
Returning users were harmed
Bayesian analysis showed a 99.3% probability that treatment hurt returning users.
Some segment wins were false positives
14 segment cuts were tested. 9 looked significant before FDR, but only 5 survived correction.
Guardrails failed
Refund rate increased and revenue per user fell despite higher conversion.
Global rollout added little upside
Global rollout added only $1500 more net revenue than restricted rollout, with $33700 more refund cost.
Four layers of analysis, applied in sequence.
Topline experiment read
Calculated conversion rate, absolute uplift, relative uplift, z-test, p-value, and 95% confidence interval.
CUPED adjustment
Controlled for pre-experiment activity to reduce bias and estimate a cleaner treatment effect.
Segment and FDR analysis
Tested user, browser, country, device, and traffic segments, then applied Benjamini-Hochberg correction.
Guardrails and Bayesian risk
Checked refund rate, revenue per user, revenue per converter, posterior risk, and rollout impact.
What changed
the read.
Pre-experiment imbalance inflated the raw signal by 1.7 percentage points.
Effect driven by days 1 and 2. Signal degraded by day 10.
99.3% posterior probability that treatment hurt returning users.
Higher conversion did not translate to higher-quality conversions.
Do not ship
globally.
Variant B created more conversions, but lower-quality conversions. Refunds rose, revenue per converter fell, and the returning-user risk was too high for a global rollout.
Most of the upside,
much less risk.
Restricted rollout keeps most of the upside while reducing refund cost and avoiding a worse experience for returning users.
Run a cleaner experiment
before scaling.
Run for 4 full weeks
Short runs favor early-adopter effects and inflate uplift.
Use CUPED-adjusted uplift as success metric
Raw lift is vulnerable to pre-experiment imbalance.
Set the threshold at +2% adjusted uplift
The current +3.4% CUPED uplift clears the bar, but barely.
Add refund rate as a decision metric
Quality guardrails should be checked before conversion.
Keep a holdout for returning users
Returning users showed harm. Monitor them separately before expanding.
What I owned.
I structured the experiment analysis, checked topline significance, applied CUPED adjustment, analyzed segment heterogeneity, corrected for multiple testing, reviewed Bayesian risk, evaluated guardrail metrics, and translated the result into a rollout decision.
Explore the rest of the work.
The repository includes topline analysis, CUPED adjustment, segment testing with FDR correction, Bayesian risk evaluation, guardrail checks, and rollout impact modeling.