Case Study · Experimentation

A/B Test
Misleading
Case Study

An experimentation case study showing why a statistically significant conversion lift was not enough to ship globally.

Focus

Experimentation · Guardrail metrics · Decision risk

Stack

Python · SQL · CUPED · Bayesian analysis

Type

Portfolio case study · synthetic experiment data

Experiment overview

conversion signal

guardrail / risk

Experiment overview

Control

14.80%

Treatment

15.56%

+5.1%

Raw uplift

+3.4%

CUPED-adjusted

⚠ Guardrail failed

Refund ↑ · RPU ↓

Segment split

New ↑ · Returning ↓

+5.1%

Topline conversion lift

+3.4%

CUPED-adjusted uplift

99.3%

Probability treatment hurts returning users

Do not ship

Global rollout decision

The problem

The test won.
The decision was still no.

The treatment increased conversion from 14.80% to 15.56%, with +5.1% relative uplift. A simple read would have shipped Variant B. But the deeper analysis showed baseline imbalance, time decay, segment harm, failed guardrails, and lower-quality conversions.

What looked wrong

ToplineConversion lift looked statistically significant

QualityRefunds rose and revenue per user fell

SegmentsNew users won, returning users lost

DecisionGlobal rollout created unnecessary risk

Findings

Six checks changed the decision.

CUPED reduced the lift

Raw uplift was +5.1%, but CUPED-adjusted uplift fell to +3.4%.

The effect faded after 48 hours

The result was carried by days 1 and 2, then became unreliable by day 10.

Returning users were harmed

Bayesian analysis showed a 99.3% probability that treatment hurt returning users.

Some segment wins were false positives

14 segment cuts were tested. 9 looked significant before FDR, but only 5 survived correction.

Guardrails failed

Refund rate increased and revenue per user fell despite higher conversion.

Global rollout added little upside

Global rollout added only $1500 more net revenue than restricted rollout, with $33700 more refund cost.

Method

Testing beyond the headline.

Four layers of analysis, applied in sequence.

Topline experiment read

Calculated conversion rate, absolute uplift, relative uplift, z-test, p-value, and 95% confidence interval.

CUPED adjustment

Controlled for pre-experiment activity to reduce bias and estimate a cleaner treatment effect.

III

Segment and FDR analysis

Tested user, browser, country, device, and traffic segments, then applied Benjamini-Hochberg correction.

Guardrails and Bayesian risk

Checked refund rate, revenue per user, revenue per converter, posterior risk, and rollout impact.

Evidence

What changed
the read.

Topline vs CUPED uplift

+5.1%

Raw uplift

→

+3.4%

CUPED-adjusted

Raw

+5.1%

CUPED

+3.4%

Pre-experiment imbalance inflated the raw signal by 1.7 percentage points.

Uplift stability over time

Effect driven by days 1 and 2. Signal degraded by day 10.

Segment divergence

New users

+8.2%

Returning

−4.1%

Safari mobile

excluded

99.3% posterior probability that treatment hurt returning users.

Guardrail failure

Refund rate

3.1%→4.9%

↑ worse

Revenue / user

$6.72→$6.41

↑ worse

Higher conversion did not translate to higher-quality conversions.

Decision

Do not ship
globally.

Variant B created more conversions, but lower-quality conversions. Refunds rose, revenue per converter fell, and the returning-user risk was too high for a global rollout.

Rollout decision by segment

New users on Chrome

Roll out carefully

Returning users

Keep control

Mobile Safari

Exclude until fixed

Organic and direct traffic

Keep monitoring

Paid social

Strongest candidate

Business impact

Most of the upside,
much less risk.

Restricted rollout keeps most of the upside while reducing refund cost and avoiding a worse experience for returning users.

Ship globally+$104200 net revenue

New users only+$102700 net revenue

Difference+$1500 extra (global)

Extra refund cost+$33700 (global)

Next test

Run a cleaner experiment
before scaling.

Run for 4 full weeks

Short runs favor early-adopter effects and inflate uplift.

Use CUPED-adjusted uplift as success metric

Raw lift is vulnerable to pre-experiment imbalance.

Set the threshold at +2% adjusted uplift

The current +3.4% CUPED uplift clears the bar, but barely.

Add refund rate as a decision metric

Quality guardrails should be checked before conversion.

Keep a holdout for returning users

Returning users showed harm. Monitor them separately before expanding.

My role

What I owned.

I structured the experiment analysis, checked topline significance, applied CUPED adjustment, analyzed segment heterogeneity, corrected for multiple testing, reviewed Bayesian risk, evaluated guardrail metrics, and translated the result into a rollout decision.

What this shows

Evaluate A/B tests beyond statistical significance.

Use guardrail metrics and segmentation to protect decision quality.

Turn experimentation results into safer product rollout recommendations.

Next.

Explore the rest of the work.

Back to Work View GitHub ↗

The repository includes topline analysis, CUPED adjustment, segment testing with FDR correction, Bayesian risk evaluation, guardrail checks, and rollout impact modeling.

A/B TestMisleadingCase Study

The test won.The decision was still no.

CUPED reduced the lift

The effect faded after 48 hours

Returning users were harmed

Some segment wins were false positives

Guardrails failed

Global rollout added little upside

Topline experiment read

CUPED adjustment

Segment and FDR analysis

Guardrails and Bayesian risk

What changedthe read.

Do not shipglobally.

Most of the upside,much less risk.

Run a cleaner experimentbefore scaling.

Run for 4 full weeks

Use CUPED-adjusted uplift as success metric

Set the threshold at +2% adjusted uplift

Add refund rate as a decision metric

Keep a holdout for returning users

What I owned.

Explore the rest of the work.

A/B Test
Misleading
Case Study

The test won.
The decision was still no.

What changed
the read.

Do not ship
globally.

Most of the upside,
much less risk.

Run a cleaner experiment
before scaling.