Case Study · Experimentation

A/B Test
Misleading
Case Study

An experimentation case study showing why a statistically significant conversion lift was not enough to ship globally.

Focus
Experimentation · Guardrail metrics · Decision risk
Stack
Python · SQL · CUPED · Bayesian analysis
Type
Portfolio case study · synthetic experiment data
Experiment overview
Control14.80%Treatment15.56%+5.1%raw+3.4%CUPED⚠ Guardrail failedRefund +1.8pp · RPU ↓Segment splitNew ↑ · Returning ↓ · Safari ✗
conversion signal
guardrail / risk
Experiment overview
Control
14.80%
Treatment
15.56%
+5.1%
Raw uplift
+3.4%
CUPED-adjusted
⚠ Guardrail failed
Refund ↑ · RPU ↓
Segment split
New ↑ · Returning ↓
+5.1%
Topline conversion lift
+3.4%
CUPED-adjusted uplift
99.3%
Probability treatment hurts returning users
Do not ship
Global rollout decision
The problem

The test won.
The decision was still no.

The treatment increased conversion from 14.80% to 15.56%, with +5.1% relative uplift. A simple read would have shipped Variant B. But the deeper analysis showed baseline imbalance, time decay, segment harm, failed guardrails, and lower-quality conversions.

What looked wrong
ToplineConversion lift looked statistically significant
QualityRefunds rose and revenue per user fell
SegmentsNew users won, returning users lost
DecisionGlobal rollout created unnecessary risk
Findings
Six checks changed the decision.
01

CUPED reduced the lift

Raw uplift was +5.1%, but CUPED-adjusted uplift fell to +3.4%.

02

The effect faded after 48 hours

The result was carried by days 1 and 2, then became unreliable by day 10.

03

Returning users were harmed

Bayesian analysis showed a 99.3% probability that treatment hurt returning users.

04

Some segment wins were false positives

14 segment cuts were tested. 9 looked significant before FDR, but only 5 survived correction.

05

Guardrails failed

Refund rate increased and revenue per user fell despite higher conversion.

06

Global rollout added little upside

Global rollout added only $1500 more net revenue than restricted rollout, with $33700 more refund cost.

Method
Testing beyond the headline.

Four layers of analysis, applied in sequence.

I

Topline experiment read

Calculated conversion rate, absolute uplift, relative uplift, z-test, p-value, and 95% confidence interval.

II

CUPED adjustment

Controlled for pre-experiment activity to reduce bias and estimate a cleaner treatment effect.

III

Segment and FDR analysis

Tested user, browser, country, device, and traffic segments, then applied Benjamini-Hochberg correction.

IV

Guardrails and Bayesian risk

Checked refund rate, revenue per user, revenue per converter, posterior risk, and rollout impact.

Evidence

What changed
the read.

Topline vs CUPED uplift
+5.1%
Raw uplift
+3.4%
CUPED-adjusted
Raw
+5.1%
CUPED
+3.4%

Pre-experiment imbalance inflated the raw signal by 1.7 percentage points.

Uplift stability over time
D1D2D4D6D8D10+5%0%unreliable

Effect driven by days 1 and 2. Signal degraded by day 10.

Segment divergence
New users
+8.2%
Returning
−4.1%
Safari mobile
excluded

99.3% posterior probability that treatment hurt returning users.

Guardrail failure
Refund rate
3.1%4.9%
↑ worse
Revenue / user
$6.72$6.41
↑ worse

Higher conversion did not translate to higher-quality conversions.

Decision

Do not ship
globally.

Variant B created more conversions, but lower-quality conversions. Refunds rose, revenue per converter fell, and the returning-user risk was too high for a global rollout.

Rollout decision by segment
New users on Chrome
Roll out carefully
Returning users
Keep control
Mobile Safari
Exclude until fixed
Organic and direct traffic
Keep monitoring
Paid social
Strongest candidate
Business impact

Most of the upside,
much less risk.

Restricted rollout keeps most of the upside while reducing refund cost and avoiding a worse experience for returning users.

Ship globally+$104200 net revenue
New users only+$102700 net revenue
Difference+$1500 extra (global)
Extra refund cost+$33700 (global)
Next test

Run a cleaner experiment
before scaling.

1

Run for 4 full weeks

Short runs favor early-adopter effects and inflate uplift.

2

Use CUPED-adjusted uplift as success metric

Raw lift is vulnerable to pre-experiment imbalance.

3

Set the threshold at +2% adjusted uplift

The current +3.4% CUPED uplift clears the bar, but barely.

4

Add refund rate as a decision metric

Quality guardrails should be checked before conversion.

5

Keep a holdout for returning users

Returning users showed harm. Monitor them separately before expanding.

My role

What I owned.

I structured the experiment analysis, checked topline significance, applied CUPED adjustment, analyzed segment heterogeneity, corrected for multiple testing, reviewed Bayesian risk, evaluated guardrail metrics, and translated the result into a rollout decision.

What this shows
Evaluate A/B tests beyond statistical significance.
Use guardrail metrics and segmentation to protect decision quality.
Turn experimentation results into safer product rollout recommendations.
Next.

Explore the rest of the work.

The repository includes topline analysis, CUPED adjustment, segment testing with FDR correction, Bayesian risk evaluation, guardrail checks, and rollout impact modeling.

Maïssa Bounar© 2026

Create a free website with Framer, the website builder loved by startups, designers and agencies.