back to all skills

ab-testing

conversionv1.0.0

A/B test design, statistical analysis, sample size calculation, experiment prioritization, and results interpretation.

copied ✓
openclawclaude-codecursorcodex
0 installsVirusTotal: cleanSource code

A/B Testing

Workflow

1. Hypothesis Generation

Format: If we [change], then [metric] will [improve/decrease] by [amount], because [rationale].

Example: If we shorten the signup form from 5 fields to 3, then signup completion rate will increase by 15%, because friction reduction at high-intent moments increases conversion.

2. Prioritization

ICE framework (quick):

FactorScore 1-10Definition
ImpactHow much will it move the metric?
ConfidenceHow sure are we it'll work?
EaseHow fast/cheap to implement?
ICE Score(I + C + E) / 3

RICE framework (more rigorous):

FactorDefinition
ReachHow many users affected per quarter?
ImpactExpected effect size (0.25, 0.5, 1, 2, 3)
Confidence% sure (100%, 80%, 50%)
EffortPerson-weeks to implement
RICE Score(R × I × C) / E

3. Sample Size Calculation

Formula:

n = (Z_α/2 × √(2p̄(1-p̄)) + Z_β × √(p₁(1-p₁) + p₂(1-p₂)))² / (p₂ - p₁)²

Where:
  p₁ = baseline conversion rate
  p₂ = expected conversion rate (baseline × (1 + MDE))
  p̄  = (p₁ + p₂) / 2
  Z_α/2 = 1.96 (for 95% confidence)
  Z_β   = 0.84 (for 80% power)

Quick reference table:

Baseline rateMDE (relative)Sample per variant
2%10%78,000
2%20%20,000
5%10%30,000
5%20%7,700
10%10%14,300
10%20%3,700
20%10%6,300
20%20%1,600

Test duration:

Days needed = (Sample per variant × 2) / Daily traffic to test page

Minimum: 7 days (capture day-of-week effects). Maximum: 4 weeks (avoid novelty decay).

4. Test Design

Rules:

  • One hypothesis per test
  • Randomly assign users, not sessions (avoid flickering)
  • Use the same metric definition for control and variant
  • Define primary metric AND guardrail metrics before launch
  • Don't peek at results before reaching sample size

Guardrail metrics (always monitor):

  • Page load time (variant shouldn't be slower)
  • Error rate
  • Revenue per user (don't increase signups but tank revenue)
  • Bounce rate

5. Statistical Analysis

Frequentist approach (standard):

import numpy as np
from scipy import stats

# Results
control = {'visitors': 5000, 'conversions': 250}  # 5.0%
variant = {'visitors': 5000, 'conversions': 295}  # 5.9%

p1 = control['conversions'] / control['visitors']
p2 = variant['conversions'] / variant['visitors']
p_pool = (control['conversions'] + variant['conversions']) / (control['visitors'] + variant['visitors'])

se = np.sqrt(p_pool * (1 - p_pool) * (1/control['visitors'] + 1/variant['visitors']))
z = (p2 - p1) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

lift = (p2 - p1) / p1 * 100
ci_95 = 1.96 * np.sqrt(p1*(1-p1)/control['visitors'] + p2*(1-p2)/variant['visitors'])

print(f"Control: {p1:.3%}")
print(f"Variant: {p2:.3%}")
print(f"Lift: {lift:.1f}%")
print(f"95% CI: [{(p2-p1-ci_95)/p1*100:.1f}%, {(p2-p1+ci_95)/p1*100:.1f}%]")
print(f"p-value: {p_value:.4f}")
print(f"Significant: {'Yes' if p_value < 0.05 else 'No'}")

Bayesian approach (when you want probability of being better):

from scipy.stats import beta

a_alpha = control['conversions'] + 1
a_beta = control['visitors'] - control['conversions'] + 1
b_alpha = variant['conversions'] + 1
b_beta = variant['visitors'] - variant['conversions'] + 1

# Monte Carlo simulation
samples_a = beta.rvs(a_alpha, a_beta, size=100000)
samples_b = beta.rvs(b_alpha, b_beta, size=100000)

prob_b_better = (samples_b > samples_a).mean()
print(f"P(variant > control): {prob_b_better:.1%}")

6. Ship / No-Ship Decision

ScenarioDecision
p < 0.05 AND lift > MDE AND guardrails OKShip
p < 0.05 AND lift > 0 but < MDEShip if no cost, otherwise iterate
p > 0.05 AND lift direction positiveInconclusive — extend or iterate
p < 0.05 AND lift negativeKill variant
Guardrail metric degradedKill variant regardless of primary metric

7. Documentation Template

## Test: [Name]
**Hypothesis:** If we [change], then [metric] will [change] by [amount]
**Primary metric:** [metric name]
**Guardrails:** [metric 1, metric 2]
**Sample size:** [X per variant]
**Duration:** [start] to [end]

### Results
| Metric | Control | Variant | Lift | p-value | Sig? |
|--------|---------|---------|------|---------|------|
| Primary | X% | Y% | +Z% | 0.XX | Y/N |

### Decision: Ship / Kill / Iterate
**Reasoning:** [Why]
**Next test:** [What we learned and what to try next]

Common Mistakes

  • Stopping early because results "look significant" (peeking inflates false positives)
  • Running too many variants (splits traffic, takes forever to reach significance)
  • Testing tiny changes on low-traffic pages (will never reach significance)
  • Not segmenting results (variant might win overall but lose on mobile)
  • Ignoring practical significance (statistically significant 0.1% lift isn't worth shipping)