Causal Inference

Causal Inference Playbook

All techniques from Uber's causal inference and mediation modeling blogs β€” with interactive examples applied at Uber and Intuit.

ExperimentsObservational StudiesMediationHeterogeneous EffectsTime Series

Techniques Extracted from the Articles

TechniqueContextArticle
CUPED (Variance Reduction)Randomized ExperimentsCausal Inference at Uber
Propensity Score MatchingObservationalBoth articles
Inverse Probability Weighting (IPW)Observational / ExperimentsCausal Inference at Uber
Regression Discontinuity Design (RDD)Observational (quasi-experiment)Causal Inference at Uber
Difference-in-Differences (DiD)Observational (panel)Causal Inference at Uber
Synthetic ControlObservational (aggregate)Causal Inference at Uber
Instrumental Variables (IV)Observational (endogeneity)Causal Inference at Uber
Mediation Analysis (ACME / ADE)Mechanism decompositionMediation Modeling at Uber
Uplift Modeling / HTEHeterogeneous effectsCausal Inference at Uber
Bayesian Structural Time Series (BSTS)Time series counterfactualCausal Inference at Uber
CACE (Complier Average Causal Effect)Non-compliance in experimentsCausal Inference at Uber

πŸ”¬ Experimental Methods

When you can randomize: CUPED for variance reduction, CACE for non-compliance, Uplift for heterogeneous treatment effects.

πŸ”­ Observational Methods

When you cannot randomize: PSM, IPW, RDD, DiD, Synthetic Control, IV β€” each handles a different source of confounding.

βš™οΈ Mechanism Methods

When you need to know why: Mediation analysis decomposes total effects into direct and indirect pathways.

πŸ“ˆ Time Series Methods

When treatment is aggregate: BSTS with synthetic controls builds counterfactual time series to estimate impact.

Experimental

CUPED β€” Controlled-experiment Using Pre-Experiment Data

Reduce variance in A/B tests by leveraging pre-experiment data, so you can detect smaller effects faster.

Core Formula
ΕΆ_cv = Y βˆ’ ΞΈ(X βˆ’ E[X])
where ΞΈ = Cov(Y, X) / Var(X), and X is a pre-experiment covariate

Statistical Perspective

CUPED is a control variate method borrowed from Monte Carlo simulation. The key insight: if you have a covariate X correlated with your outcome Y, you can construct an adjusted outcome ΕΆ = Y βˆ’ ΞΈ(X βˆ’ E[X]) that has lower variance than Y while preserving the same expectation (since E[X βˆ’ E[X]] = 0, the adjustment is mean-zero).

The optimal ΞΈ = Cov(Y,X)/Var(X) minimizes Var(ΕΆ). The resulting variance reduction is exactly ρ², where ρ is the Pearson correlation between X and Y. This is equivalent to regressing Y on X and using the residuals β€” the treatment effect estimate from comparing CUPED-adjusted means is identical to the coefficient on the treatment indicator in a regression of Y on [Treatment, X].

Why it works for causal inference: X is measured pre-randomization, so it's independent of treatment assignment. The adjustment doesn't introduce bias β€” it only removes predictable variation in Y, leaving the treatment signal cleaner.

Intuitive Perspective

Imagine you're comparing test scores between two classrooms after a new teaching method. Some students are naturally A students, others are C students. This natural variation makes it hard to see a small teaching improvement.

CUPED's trick: Before the experiment, you already know each student's GPA (the pre-experiment covariate). Instead of looking at raw test scores, you look at how much better or worse each student did compared to what you'd predict from their GPA. An A student scoring 92 is "normal"; an A student scoring 88 is "unusually low." A C student scoring 78 is "surprisingly good."

By stripping out the predictable part ("of course the A student scored high"), you're left with only the surprise β€” and the treatment effect shows up much more clearly in those surprises. The better you can predict baseline performance (higher correlation), the more noise you strip away.

The punchline: CUPED doesn't change what you measure or who you test. It just says "don't be impressed that good students scored high β€” tell me if they scored higher than expected."

Uber Dispatch Algorithm A/B Test

Business Context: Uber's marketplace team is testing a new dispatch algorithm ("Smart Match v2") that considers driver-rider personality compatibility scores alongside ETA. The primary metric is ride completion rate. The challenge: ride completion varies enormously across rider segments β€” a power commuter completes 98% of rides, while a casual weekend user might cancel 30%. This heterogeneity inflates variance and makes it hard to detect the expected ~0.3pp improvement.

Data Setup: Pre-experiment covariate X = each rider's rolling 60-day ride completion rate (measured before randomization). Post-experiment outcome Y = completion rate during the 2-week test. The correlation between X and Y is ρ β‰ˆ 0.82.

Methodology: For each rider, compute ΕΆ_cv = Y βˆ’ ΞΈ(X βˆ’ XΜ„), where ΞΈ = Cov(Y,X)/Var(X). This strips out the "predictable" part of each rider's behavior, leaving only the experiment-driven variation. With ρ = 0.82, variance reduction = ρ² = 67%.

Results: Raw analysis: ATE = 0.31pp, p = 0.18 (not significant at n = 400K after 2 weeks). CUPED-adjusted: ATE = 0.29pp, SE drops from 0.23 to 0.13, p = 0.026. The experiment reaches significance without extending the test or increasing traffic allocation.

Takeaway: CUPED didn't change the point estimate β€” it shrank the confidence interval. Without it, the team would have needed to run the test 3x longer or ship based on a directional-but-insignificant result. This is critical in Uber's marketplace where longer experiments risk marketplace contamination.

Intuit W-2 Import Flow Redesign

Business Context: TurboTax is testing a redesigned W-2 import flow that uses OCR + auto-fill instead of manual entry. The primary metric is filing completion rate. Challenge: filing completion is heavily driven by user tax complexity β€” a simple single-W-2 filer completes at 85%, while a multi-income household with investments completes at 45%. This creates massive variance in the outcome metric.

Data Setup: Pre-experiment covariate X = each user's prior-year filing completion status (binary: completed vs. abandoned) plus number of forms filed last year. These are measured before the experiment, so they're immune to treatment contamination. ρ(X, Y) β‰ˆ 0.74.

Methodology: Extend CUPED to multiple covariates: ΕΆ_cv = Y βˆ’ Θ'(X βˆ’ XΜ„), where Θ is fit via OLS of Y on X in the control group. The multi-variate version captures both the "did they finish last year" signal and the "how complex is their return" signal.

Results: Raw: ATE = 1.8pp, SE = 1.1pp, p = 0.10 (would need 2 more weeks). CUPED-adjusted: ATE = 1.7pp, SE = 0.64pp, p = 0.008. The team ships in week 1 with high confidence, catching the peak of tax season (critical timing β€” each week of delay costs ~$2M in potential revenue).

Takeaway: In tax, timing is everything. Filing season is ~12 weeks long. Running a 3-week experiment means shipping a winning feature for only 9 weeks. CUPED compressed the test to 1 week, giving 2 extra weeks of the improved experience during peak filing volume.

Interactive: See How CUPED Reduces Variance

0.70
50
Variance Reduction: 49%
Adjusted Variance: 25.5  |  Required sample shrinks by ~49%

Python Example

# CUPED Variance Reduction
import numpy as np
import pandas as pd

def cuped_adjust(y_post, x_pre):
    """Adjust post-experiment metric using pre-experiment covariate."""
    theta = np.cov(y_post, x_pre)[0,1] / np.var(x_pre)
    y_adjusted = y_post - theta * (x_pre - np.mean(x_pre))
    return y_adjusted

# Example: TurboTax filing completion experiment
np.random.seed(42)
n = 10000

# Pre-experiment: prior year filing progress (0-100%)
x_pre = np.random.normal(65, 20, n)

# Post-experiment: current year completion (correlated with pre)
treatment = np.random.binomial(1, 0.5, n)
true_effect = 2.0  # True ATE = 2 percentage points
y_post = 0.7 * x_pre + treatment * true_effect + np.random.normal(0, 15, n)

# Standard estimate (high variance)
ate_raw = y_post[treatment==1].mean() - y_post[treatment==0].mean()

# CUPED-adjusted estimate (lower variance)
y_adj = cuped_adjust(y_post, x_pre)
ate_cuped = y_adj[treatment==1].mean() - y_adj[treatment==0].mean()

print(f"Raw ATE:   {ate_raw:.3f}  (SE: {y_post.std()/np.sqrt(n/2):.3f})")
print(f"CUPED ATE: {ate_cuped:.3f}  (SE: {y_adj.std()/np.sqrt(n/2):.3f})")
print(f"Variance reduction: {1 - y_adj.var()/y_post.var():.1%}")

Key Assumptions

  • Pre-experiment covariate X is measured before randomization (not affected by treatment)
  • Linear relationship between X and Y (for standard CUPED; extensions exist for non-linear)
  • Higher correlation between X and Y β†’ more variance reduction (ρ² is the fraction reduced)
  • Randomization is valid β€” CUPED reduces variance, it does not fix broken randomization
Observational

Propensity Score Matching (PSM)

Match treated and control units based on their probability of receiving treatment, to approximate a randomized experiment from observational data.

Propensity Score
e(X) = P(T = 1 | X)
Estimated via logistic regression. Match units with similar e(X) across treatment groups.

Statistical Perspective

The fundamental problem: in observational data, treatment assignment depends on covariates X (confounders). The propensity score theorem (Rosenbaum & Rubin, 1983) states that if treatment assignment is strongly ignorable given X β€” i.e., (Y(0), Y(1)) βŠ₯ T | X β€” then it's also ignorable given the scalar e(X) = P(T=1|X).

This is a dimensionality reduction result: instead of matching on 20 covariates (curse of dimensionality), you match on a single number. The propensity score is a balancing score β€” within strata of e(X), the distribution of X is the same for treated and control units. Matching on e(X) thus creates approximate covariate balance.

After matching, you estimate the ATT (Average Treatment Effect on the Treated) by comparing matched outcomes. The quality of the estimate depends on: (1) correct specification of the propensity model, (2) sufficient overlap (common support), and (3) no unmeasured confounders. Sensitivity analysis (Rosenbaum bounds) quantifies how strong an unmeasured confounder would need to be to invalidate the results.

Intuitive Perspective

Think of it as finding your twin. You want to know if a drug works, but you can't run a trial. Some people took the drug (treatment) and some didn't (control). The problem: the people who took the drug are different β€” maybe they're sicker, older, or more health-conscious.

Step 1: For each person, compute a single number: "How likely was this person to take the drug, given their characteristics?" A 60-year-old diabetic might have a 70% chance; a healthy 25-year-old might have a 5% chance. That's the propensity score.

Step 2: Now pair up each drug-taker with a non-drug-taker who had the same likelihood of taking the drug. A 70%-propensity person who DID take it gets matched with a 70%-propensity person who DIDN'T. These "twins" are similar in all the ways that matter for the treatment decision.

Step 3: Compare outcomes within each pair. The average difference is your causal estimate. The intuition: if two people were equally likely to take the drug but only one did, the difference in their outcomes is more plausibly caused by the drug (not by their background characteristics).

Uber Uber Eats Delivery Delays β†’ Customer Retention

Business Context: Uber Eats product leadership wants to quantify exactly how much a late delivery costs in customer lifetime value. They can't run an experiment (you can't randomly delay people's food). But natural delays happen constantly due to restaurant prep time variance, driver availability, and traffic. The question: does a 15-minute delay cause lower retention, or do low-retention customers just tend to order during high-delay periods (Friday dinner rush)?

Data Setup: 200K orders from Q3. Treatment: order delivered β‰₯15 min late vs. on-time. Covariates for propensity model: (1) user's historical order frequency, (2) cuisine category, (3) order time (hour + day-of-week), (4) restaurant-to-customer distance, (5) current marketplace utilization rate, (6) user tenure on platform, (7) average prior order value. Outcome: binary 30-day reorder.

Methodology: Fit logistic regression: P(delayed | covariates). Verify overlap β€” check that propensity score distributions for delayed and non-delayed orders have substantial common support (they do, between 0.08–0.65). Use 1:1 nearest-neighbor matching without replacement on the logit of the propensity score (caliper = 0.2 SD). Post-matching, verify covariate balance: all standardized mean differences < 0.05.

Results: Naive comparison: delayed orders have 11pp lower 30-day reorder (confounded β€” Friday rush orders are both more delayed AND from less loyal "try it once" users). After PSM: ATT = βˆ’4.1pp (95% CI: [βˆ’5.3, βˆ’2.9]). The confounding explained 7pp of the naive gap. At Uber Eats' scale, this 4pp translates to ~$180M/year in lost LTV.

Takeaway: The PSM result gave the operations team a concrete dollar figure to justify investing in delivery time reduction. It also revealed that the effect is non-linear: delays of 5–10 min have negligible impact, but the retention cliff is steep beyond 15 min β€” informing their SLA threshold design.

Intuit TurboTax Live Expert Impact on Filing Completion

Business Context: TurboTax Live connects filers with CPAs/EAs for real-time help. Users who engage with Live complete at 89% vs. 71% for DIY-only. But there's massive self-selection: users who seek expert help tend to be more motivated, have more complex returns (meaning more at stake), and have higher income. The product team needs the causal effect to justify the cost of expert staffing (~$35/session).

Data Setup: 150K users in the TurboTax Live eligible population from TY24. Treatment: engaged with a live expert (at least one session). Covariates: (1) prior-year product tier, (2) number of tax forms, (3) AGI bracket, (4) filing status, (5) entry point (organic vs. paid), (6) time spent before first expert interaction, (7) state complexity score, (8) mobile vs. desktop.

Methodology: Propensity model: gradient-boosted classifier (logistic regression was insufficient due to non-linear interactions between AGI and form count). Match on propensity score with caliper = 0.1 SD. 1:3 matching (each Live user matched to 3 DIY users) to preserve power. Post-matching balance check: all standardized differences < 0.03. Sensitivity analysis via Rosenbaum bounds: results hold up to Ξ“ = 1.8 (an unmeasured confounder would need to increase odds of treatment by 80% to explain away the effect).

Results: Naive gap: 18pp. After PSM: ATT = 8.2pp (95% CI: [6.1, 10.3]). Self-selection explains ~10pp of the raw gap. At an 8pp completion lift and $120 average revenue per completed return, each $35 expert session generates $9.60 in incremental revenue β€” a 27% ROI before accounting for downstream retention and upsell.

Takeaway: Without PSM, leadership would have valued Live sessions at $21.60/session (based on naive 18pp), over-investing in expert staffing. The corrected $9.60 figure is still positive ROI but changes the staffing model: prioritize expert availability for high-complexity filers (where the CATE is highest) rather than blanket availability.

Python Example

# Propensity Score Matching
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np, pandas as pd

def propensity_score_match(df, covariates, treatment_col, outcome_col, n_neighbors=1):
    """Match treated to control units on propensity score."""
    # Step 1: Estimate propensity scores
    lr = LogisticRegression(max_iter=1000)
    lr.fit(df[covariates], df[treatment_col])
    df['ps'] = lr.predict_proba(df[covariates])[:, 1]

    # Step 2: Match treated to nearest control
    treated = df[df[treatment_col] == 1]
    control = df[df[treatment_col] == 0]
    nn = NearestNeighbors(n_neighbors=n_neighbors, metric='euclidean')
    nn.fit(control[['ps']])
    distances, indices = nn.kneighbors(treated[['ps']])
    matched_control = control.iloc[indices.flatten()]

    # Step 3: Estimate ATT
    att = treated[outcome_col].mean() - matched_control[outcome_col].mean()
    return att, df['ps']

# Example: Uber Eats delivery delay β†’ reorder rate
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    'order_freq': np.random.poisson(8, n),
    'avg_distance': np.random.exponential(3, n),
    'peak_hour': np.random.binomial(1, 0.3, n),
})
# Delay is more likely for distant, peak-hour orders (confounded)
logit = -1 + 0.15*df['avg_distance'] + 0.8*df['peak_hour']
df['delayed'] = np.random.binomial(1, 1/(1+np.exp(-logit)))
# Outcome: 30-day reorder (delay has true -4pp effect)
df['reorder'] = np.random.binomial(1, np.clip(
    0.6 + 0.02*df['order_freq'] - 0.04*df['delayed'], 0, 1))

att, ps = propensity_score_match(
    df, ['order_freq', 'avg_distance', 'peak_hour'], 'delayed', 'reorder')
print(f"Estimated ATT: {att:.3f}")  # Should be close to -0.04

Key Assumptions

  • Unconfoundedness / Conditional Independence: No unmeasured confounders β€” all variables affecting both treatment and outcome are observed
  • Common Support (Overlap): For every treated unit, there exists a comparable control unit with similar propensity score
  • SUTVA: One unit's treatment doesn't affect another unit's outcome
  • Model specification matters β€” if the propensity model is wrong, matches are biased
Observational / Experimental

Inverse Probability Weighting (IPW)

Re-weight observations to create a pseudo-population where treatment is independent of confounders. No need to discard unmatched units.

IPW Estimator (ATE)
Ο„Μ‚_IPW = (1/n) Ξ£ [ TΒ·Y / e(X) βˆ’ (1βˆ’T)Β·Y / (1βˆ’e(X)) ]
Weight each observation by inverse of its probability of receiving the treatment it actually got.

Statistical Perspective

IPW creates a pseudo-population where treatment is independent of confounders. The Horvitz-Thompson estimator re-weights each observation: treated units get weight 1/e(X), control units get 1/(1βˆ’e(X)). This is equivalent to creating a survey-style weighted sample where the "sampling probability" is the probability of the treatment actually received.

Mathematically: E[TY/e(X)] = E[E[TY/e(X)|X]] = E[Y(1)Β·P(T=1|X)/e(X)] = E[Y(1)]. The propensity score in the denominator cancels the selection bias in the numerator. Stabilized weights (multiply by P(T)/e(X) instead of 1/e(X)) reduce variance without introducing bias. Weight trimming at extreme propensity scores (e.g., <0.05 or >0.95) trades small bias for large variance reduction.

Key advantage over matching: IPW uses ALL observations (no discarding), and the estimand is the ATE (not just ATT). It's also more naturally combined with regression (the "doubly robust" estimator: correct if EITHER the propensity model OR the outcome model is correct).

Intuitive Perspective

Imagine a university studying whether study groups help exam scores. Students self-select: 90% of pre-med students join groups, but only 20% of art students do. If you naively compare group vs. no-group, you're mostly comparing pre-med (group) to art (no-group) students.

IPW's fix: Give rare observations more weight. That art student who DID join a group? They're unusual and informative β€” weight them more heavily (1/0.20 = 5x). That pre-med student who joined? Expected and redundant β€” weight them less (1/0.90 = 1.1x). For the control side: the pre-med who DIDN'T join gets heavy weight (1/0.10 = 10x β€” they're rare and informative) while the art student who didn't join gets low weight (1/0.80 = 1.25x).

The effect: After re-weighting, the "group" and "no-group" populations look identical in composition β€” same mix of pre-med and art students. Any remaining difference in outcomes can be attributed to the group itself, not to who chose to join.

vs. Matching: Matching throws away unmatched people. IPW keeps everyone but turns the volume up on informative observations and down on redundant ones. It's like equalizing a survey where some groups are oversampled.

Uber Surge Pricing β†’ Ride Request Elasticity

Business Context: Uber's pricing team needs to understand how surge multipliers affect ride request probability. They can't randomize surge (it's set by real-time supply/demand), and naive comparisons are hopelessly confounded: surge happens precisely when demand is already high, making it look like surge increases demand.

Data Setup: 500K ride request opportunities (app opens where a price was shown) across 30 cities over 8 weeks. Treatment: surge β‰₯ 1.5x. Covariates: (1) metro area, (2) hour-of-day Γ— day-of-week, (3) weather conditions, (4) active drivers within 5 min, (5) local event flags (concerts, sports), (6) historical demand for that zone-hour. Outcome: binary ride request within 5 minutes.

Why IPW over PSM: PSM would discard ~40% of observations (poor overlap in extreme demand periods). IPW retains all observations, weighted to represent the population that could have plausibly experienced either surge or no-surge. Stabilized weights are used: w = P(T) / P(T|X) to reduce variance from extreme weights.

Results: Naive: surge zones show +5% ride requests (confounded β€” high demand drives both surge and requests). IPW-adjusted: surge β‰₯ 1.5x causes βˆ’18.3% (95% CI: [βˆ’21.1, βˆ’15.5]) reduction in ride requests. The pricing team uses this elasticity curve to optimize the surge multiplier schedule β€” finding that the revenue-maximizing surge is 1.3x, not the 1.8x the algorithm was frequently setting.

Intuit Recovering a Broken Experiment via IPW

Business Context: TurboTax ran an experiment on a new "Tax Timeline" dashboard. After 2 weeks, the experimentation platform flagged a sample ratio mismatch: treatment had 8% fewer users than expected. Investigation revealed that the new dashboard loaded 200ms slower on older Android devices, causing those users to bounce before the assignment event fired. Re-running would waste 2 weeks of peak tax season.

Data Setup: 300K users assigned (50/50). Treatment retained 138K; control retained 150K. The 12K "missing" treatment users are disproportionately: older Android, lower bandwidth, first-time filers. Covariates for dropout model: device age, OS version, connection speed, prior-year filing status, session start time.

Methodology: Model P(retained in sample | assigned to treatment, covariates) using logistic regression. Weight each remaining treatment user by 1/P(retained). This up-weights users similar to the ones who dropped (old Android, slow connection), reconstructing what the treatment group would have looked like without the telemetry gap. Verify: weighted covariate distributions match control group.

Results: Unweighted (biased): ATE = +3.1pp completion lift (inflated β€” the dropped users were harder-to-convert). IPW-adjusted: ATE = +1.9pp (95% CI: [0.8, 3.0]). Still significant and shippable, but the correct effect size changes the projected revenue impact from $8M to $5M β€” affecting the prioritization of the fix for the performance regression.

Python Example

# Inverse Probability Weighting
import numpy as np
from sklearn.linear_model import LogisticRegression

def ipw_ate(y, treatment, X, clip_bounds=(0.05, 0.95)):
    """Estimate ATE using stabilized IPW."""
    lr = LogisticRegression(max_iter=1000)
    lr.fit(X, treatment)
    ps = lr.predict_proba(X)[:, 1]
    ps = np.clip(ps, *clip_bounds)  # Trim extreme weights

    # Horvitz-Thompson estimator
    w1 = treatment / ps
    w0 = (1 - treatment) / (1 - ps)
    ate = np.mean(w1 * y) - np.mean(w0 * y)

    # Stabilized weights (less variance)
    p_treat = treatment.mean()
    sw1 = treatment * p_treat / ps
    sw0 = (1 - treatment) * (1 - p_treat) / (1 - ps)
    ate_stab = (np.sum(sw1 * y) / np.sum(sw1)) - (np.sum(sw0 * y) / np.sum(sw0))

    return {'ate': ate, 'ate_stabilized': ate_stab, 'ps': ps}

# Example: Surge pricing effect on ride requests
np.random.seed(42)
n = 8000
demand_level = np.random.normal(0, 1, n)  # Confounder
surge = np.random.binomial(1, 1/(1+np.exp(-demand_level)))
rides = 50 + 10*demand_level - 9*surge + np.random.normal(0, 5, n)

result = ipw_ate(rides, surge, demand_level.reshape(-1,1))
print(f"IPW ATE: {result['ate']:.2f}")  # True: -9
print(f"Naive diff: {rides[surge==1].mean()-rides[surge==0].mean():.2f}")  # Biased
Quasi-Experimental

Regression Discontinuity Design (RDD)

Exploit a cutoff rule: units just above vs. just below a threshold are near-identical, creating a natural experiment at the boundary.

Local Average Treatment Effect at Cutoff
Ο„_RDD = lim(xβ†’c⁺) E[Y|X=x] βˆ’ lim(xβ†’c⁻) E[Y|X=x]
The jump in the outcome at the cutoff c identifies the causal effect.

Statistical Perspective

RDD exploits a known assignment rule: treatment is determined by whether a "running variable" X crosses a cutoff c. The identifying assumption is that all potential confounders are continuous at the cutoff β€” there's no reason someone at X = cβˆ’Ξ΅ should be different from someone at X = c+Ξ΅ in any way except their treatment status.

Formally: E[Y(0)|X=c] and E[Y(1)|X=c] are identified by the left and right limits of E[Y|X] at c. Any discontinuity in the outcome at c is attributed to the treatment. Estimation uses local linear regression within a bandwidth h of the cutoff, fitting separate slopes on each side. Bandwidth selection (Imbens-Kalyanaraman or Calonico-Cattaneo-Titiunik) balances bias (narrower = less bias) vs. variance (wider = more data).

Validity checks: (1) McCrary test: the density of X should be continuous at c (no bunching = no manipulation). (2) Covariate smoothness: other observable characteristics should not jump at c. (3) Robustness to bandwidth choice. The estimand is the LATE at the cutoff β€” it's only valid for units near the boundary, not the full population.

Intuitive Perspective

Think of a drinking age law. On your 20th birthday, you can't legally drink. On your 21st birthday, you can. A person who is 20 years and 364 days old is essentially identical to a person who is 21 years and 1 day old β€” same maturity, same life circumstances. The only difference is the legal cutoff.

If you see a sudden jump in, say, car accidents right at age 21, you can attribute it to legal drinking access β€” because nothing else changed discontinuously at that exact birthday.

In business terms: Any time there's a rule that says "if your score/value/metric crosses X, you get different treatment" β€” and people can't precisely control their score β€” you have a natural experiment at the boundary. People just barely qualifying vs. just barely not qualifying are like a randomized experiment, but nature (or the algorithm) did the randomizing for you.

The limitation: You only learn the effect at the boundary. The effect of surge pricing at the 2.0x threshold tells you about riders experiencing ~2.0x surge β€” it may not generalize to 3.0x surge. You're looking through a narrow window, but what you see through it is very credible.

Uber Surge Display Threshold β€” Does "2.0x" Scare Riders More Than "1.9x"?

Business Context: Uber's pricing algorithm sets surge multipliers as a continuous function of local supply/demand. But the displayed price jumps at round numbers (1.5x, 2.0x, 2.5x). The behavioral economics team suspects round-number surge multipliers create a psychological "sticker shock" beyond the actual price difference. Testing this with an A/B test would require changing the pricing algorithm β€” risky in a live marketplace.

Data Setup: Running variable: the continuous demand-supply ratio that determines the surge multiplier. Cutoff: the ratio threshold at which surge crosses 2.0x. 180K ride request opportunities where the underlying ratio was within Β±0.15 of the cutoff (bandwidth selection via Imbens-Kalyanaraman optimal bandwidth). Outcome: P(rider requests ride within 3 minutes of seeing the price).

Key Insight: Riders at 1.95x and 2.05x surge face nearly identical supply/demand conditions β€” they just happen to fall on different sides of a rounding boundary. This is as-good-as-random assignment near the cutoff.

Results: Local linear regression shows a discontinuous drop of 6.2pp (95% CI: [4.1, 8.3]) in ride request probability at exactly the 2.0x boundary. By contrast, the smooth price-demand slope is only -1.8pp per 0.1x surge increment. Conclusion: 4.4pp of the drop is pure "round number" framing effect. This led Uber to test "1.9x" labeling for surges in the 1.9–2.1 range, recovering ~$12M/quarter in rides.

Intuit Free Edition Eligibility Cutoff β†’ Product Upgrade Behavior

Business Context: TurboTax Free Edition is available to filers below a complexity score threshold (based on number of forms, income sources, and deduction types). Filers scoring above the threshold are routed to a paid SKU recommendation. Product wants to know: does the forced upgrade cause abandonment, or would complex filers have struggled anyway?

Data Setup: Running variable: TurboTax complexity score (0–100, continuous). Cutoff: score = 35 (Free Edition eligibility boundary). 400K filers within Β±10 points of the cutoff from TY24. Outcomes: (1) filing completion rate, (2) paid conversion rate, (3) customer satisfaction score (post-filing NPS).

Methodology: Local linear regression within the bandwidth, allowing different slopes on each side. McCrary density test confirms no bunching at the cutoff (users can't precisely manipulate their complexity score). Robustness: results hold across bandwidths of 5, 8, 10, and 15.

Results: At the cutoff, being nudged to a paid SKU causes: (1) βˆ’7.3pp filing completion (users who see a price wall after starting for free abandon at higher rates), (2) +22pp paid conversion among those who continue, (3) βˆ’12 NPS points. Net revenue impact: positive ($4.80 incremental revenue per filer at the cutoff), but the NPS hit is a long-term retention risk. This informed the design of a "soft upgrade" flow that previews paid features before the paywall.

Interactive: Visualize the Discontinuity

20
15

Python Example

# Regression Discontinuity Design
import numpy as np
from sklearn.linear_model import LinearRegression

def rdd_estimate(running_var, outcome, cutoff, bandwidth):
    """Local linear RDD estimator."""
    mask = np.abs(running_var - cutoff) <= bandwidth
    X_local = running_var[mask]
    Y_local = outcome[mask]
    T_local = (X_local >= cutoff).astype(float)
    X_centered = X_local - cutoff

    # Y = Ξ± + τ·T + β₁·X + Ξ²β‚‚Β·TΒ·X + Ξ΅
    design = np.column_stack([T_local, X_centered, T_local * X_centered])
    reg = LinearRegression().fit(design, Y_local)
    tau = reg.coef_[0]  # coefficient on T is the RDD effect
    return tau

# Example: Uber surge threshold at multiplier = 2.0
np.random.seed(42)
n = 2000
surge_mult = np.random.uniform(1.5, 2.5, n)  # Running variable
cutoff = 2.0
treated = (surge_mult >= cutoff).astype(float)
# Ride request probability drops by 12pp at the 2x threshold
ride_request = 0.7 - 0.1*(surge_mult - 1.5) - 0.12*treated + np.random.normal(0, 0.08, n)

tau = rdd_estimate(surge_mult, ride_request, cutoff, bandwidth=0.3)
print(f"RDD effect at cutoff: {tau:.3f}")  # True: -0.12
Observational (Panel)

Difference-in-Differences (DiD)

Compare the change over time in a treated group vs. a control group. The "difference of differences" removes time-invariant confounders.

DiD Estimator
Ο„_DiD = (Θ²_T,post βˆ’ Θ²_T,pre) βˆ’ (Θ²_C,post βˆ’ Θ²_C,pre)
The treatment effect is the excess change in the treated group beyond what the control experienced.

Statistical Perspective

DiD is a panel data method that controls for time-invariant unobserved confounders by differencing. The first difference (within-group, over time) removes unit-level fixed effects. The second difference (between groups) removes common time shocks. What remains is the treatment effect.

The regression form: Y_it = Ξ± + β₁·Treat_i + Ξ²β‚‚Β·Post_t + τ·Treat_iΓ—Post_t + Ξ΅_it. The coefficient Ο„ on the interaction term is the DiD estimator. With panel data, include unit and time fixed effects: Y_it = Ξ±_i + Ξ³_t + τ·D_it + Ξ΅_it, where D_it = 1 if unit i is treated at time t. Cluster standard errors at the unit level (the level of treatment assignment).

Parallel trends: The identifying assumption. In potential outcomes: E[Y(0)_T,post βˆ’ Y(0)_T,pre] = E[Y(0)_C,post βˆ’ Y(0)_C,pre]. It's untestable for the post-period (you can't observe the treated group's counterfactual), but you can check pre-treatment trends for parallelism. Event-study specifications test for pre-trends and visualize dynamic treatment effects.

Intuitive Perspective

Imagine a restaurant chain testing a new menu in 5 locations. Sales go up 20% in those 5 stores. Is that the new menu? Maybe β€” but it's also December, and ALL stores see holiday sales bumps.

DiD's trick: Look at the 15 stores that DIDN'T get the new menu. They went up 12% (the holiday effect). So the new menu's actual effect is 20% βˆ’ 12% = 8%. You subtracted out the common trend that would have happened regardless.

The key assumption (parallel trends): Without the new menu, the 5 test stores would have followed the same 12% trajectory as the others. This is plausible if the stores were trending similarly before the change. If the test stores were already growing faster (maybe they're in booming neighborhoods), DiD over-attributes that growth to the menu.

When it's powerful: DiD shines when you have "before and after" data for both groups. It handles the biggest worry in observational studies β€” that the treatment and control groups are fundamentally different β€” by saying "I don't care if they're different in levels, only that they would have changed at the same rate."

Uber New Driver Onboarding Program β€” Staggered City Rollout

Business Context: Uber redesigned its driver onboarding flow (streamlined document upload, in-app vehicle inspection scheduling, guaranteed first-week earnings). The rollout was staggered: Phase 1 cities (Austin, Nashville, Portland, Raleigh, Salt Lake City) launched in March; Phase 2 cities would launch in June. The 15 Phase-2 cities serve as the control group. The key question: does the new onboarding causally increase the 30-day driver activation rate (first trip completed within 30 days of signup)?

Data Setup: Monthly driver activation rates for 20 cities, spanning 12 months before launch (Jan–Dec) and 4 months after (Jan–Apr). Outcome: 30-day activation rate. Pre-treatment: 12 months of data to verify parallel trends.

Parallel Trends Verification: Plot pre-treatment trends for treated vs. control cities. They track closely (both rising ~0.8pp/month due to seasonal labor market tightening). Formal test: interaction of treatment-group Γ— time in pre-period has no significant coefficients (p = 0.62). Also run a placebo test: pretend the treatment happened 6 months earlier β€” the "effect" is 0.1pp and insignificant.

Results: DiD regression with city and time fixed effects, clustered SEs at the city level: Ο„ = +14.8pp (95% CI: [10.2, 19.4]). The new onboarding flow nearly doubled the activation rate from a baseline of 18% to 33%. Staggered DiD (Callaway-Sant'Anna) confirms no dynamic treatment effect heterogeneity β€” the effect is immediate and stable.

Takeaway: This analysis was presented to the board as justification for accelerating the Phase 2 rollout. The guaranteed earnings component alone was estimated to account for ~60% of the effect (via a follow-up mediation analysis). Each incremental activated driver generates ~$1,200/month in gross bookings.

Intuit State Tax Credit Launch β†’ TurboTax Adoption

Business Context: California introduced a new refundable child tax credit ("CalKids") in TY24. Intuit's government affairs team wants to quantify whether this policy change causally increased TurboTax adoption in California, since TurboTax prominently supported the new credit in its interview flow. The business case: if state-specific tax law support drives adoption, it justifies investing engineering resources in rapid state conformity updates.

Data Setup: Treatment: California (new credit). Control: 10 comparable states without significant TY24 tax law changes (selected by: population size, prior TurboTax market share, income distribution, and filing season cadence). Outcome: weekly new TurboTax starts per capita. Pre-period: TY22–TY23 (2 years of weekly data). Post-period: TY24 filing season (Jan–Apr).

Methodology: Two-way fixed effects DiD with state and week fixed effects. Cluster SEs at state level (only 11 clusters, so use wild cluster bootstrap for valid inference). Event-study specification: estimate week-by-week treatment effects to visualize the timing of the effect β€” expect it to appear when TurboTax's CalKids marketing campaign launched (late January).

Results: Event study shows: zero effect in pre-period (parallel trends validated), a sharp +6.2% lift in TurboTax starts starting the week the CalKids campaign launched, growing to +9.1% by late March. Overall DiD: +7.4% (95% CI: [4.8, 10.0]) more new TurboTax starts per capita in California vs. control states. This translates to ~85K incremental new customers and ~$6.8M in revenue attributable to the rapid CalKids support.

Takeaway: The ROI on the CalKids engineering sprint (3 engineers Γ— 4 weeks = ~$120K cost) was 57:1. This created an internal playbook: for every major state tax law change, auto-prioritize a "first-mover" TurboTax integration to capture adoption lift.

Interactive: Difference-in-Differences

15
5

Python Example

# Difference-in-Differences
import numpy as np
import statsmodels.api as sm

# Example: Uber new onboarding in treated cities
np.random.seed(42)
n_cities, n_periods = 20, 12
treated_cities = np.array([1]*5 + [0]*15)  # 5 treated, 15 control
treatment_period = 6

data = []
for city in range(n_cities):
    baseline = np.random.normal(50, 5)
    for t in range(n_periods):
        post = int(t >= treatment_period)
        treat = treated_cities[city]
        # Common trend + treatment effect of +15pp for treated post-period
        y = baseline + 2*t + 15*treat*post + np.random.normal(0, 3)
        data.append({'city': city, 't': t, 'treat': treat,
                    'post': post, 'y': y})

import pandas as pd
df = pd.DataFrame(data)

# DiD regression: Y = Ξ²β‚€ + β₁·Treat + Ξ²β‚‚Β·Post + β₃·TreatΓ—Post + Ξ΅
df['treat_post'] = df['treat'] * df['post']
X = sm.add_constant(df[['treat', 'post', 'treat_post']])
model = sm.OLS(df['y'], X).fit(cov_type='cluster', cov_kwds={'groups': df['city']})
print(model.summary().tables[1])
# treat_post coefficient β‰ˆ 15 (the causal DiD effect)

Key Assumptions

  • Parallel Trends: Absent treatment, both groups would have followed the same trajectory β€” THE critical assumption
  • No Anticipation: Treatment group doesn't change behavior before treatment begins
  • No Spillovers: Control group is unaffected by treatment group's intervention
  • Stable Composition: The mix of units in each group doesn't change over time
Observational (Aggregate)

Synthetic Control Method

When you treat one unit (city, market), construct a weighted combination of untreated units that best matches the treated unit's pre-treatment trajectory. The gap post-treatment is the causal effect.

Synthetic Control
ΕΆ_synth(t) = Ξ£ wβ±Ό Β· Yβ±Ό(t), where weights wβ±Ό β‰₯ 0 and Ξ£wβ±Ό = 1
Weights chosen to minimize pre-treatment prediction error for the treated unit.

Statistical Perspective

Synthetic control solves a constrained optimization: find non-negative weights w₁...wJ (summing to 1) over J donor units that minimize the pre-treatment MSPE: Ξ£_t (Y₁t βˆ’ Ξ£β±Ό wβ±ΌYβ±Όt)Β² for t in pre-period. This is a convex program with a unique solution. The resulting weighted combination is the "synthetic" unit β€” a data-driven counterfactual.

Inference is non-standard (n=1 treated unit). The standard approach is permutation inference (placebo tests): apply the method to every donor unit as if IT were the treated one, compute each placebo's post/pre RMSPE ratio, and see where the treated unit ranks. If it ranks 1st out of 26 units, the p-value is 1/26 β‰ˆ 0.038. This is an exact test that makes no distributional assumptions.

Extensions: Augmented SC (adds an outcome model for bias correction), Penalized SC (regularization prevents overfitting to pre-period noise), and SC + DiD hybrids (re-weight AND de-mean, relaxing the "perfect pre-fit" requirement).

Intuitive Perspective

You renovated YOUR house and want to know if it increased your property value. There's no control version of your house. But your neighbors didn't renovate. The problem: no single neighbor's house is a perfect comparison β€” one is bigger, another is on a busier street, another was recently painted.

Synthetic control's idea: Build a "Frankenstein neighbor" β€” a weighted mix of actual neighbors that, BEFORE your renovation, had the same price trajectory as your house. Maybe 40% of Neighbor A + 35% of Neighbor B + 25% of Neighbor C tracks your house's price history almost perfectly.

After the renovation, your house's value diverges from this synthetic neighbor's. That gap is the renovation effect. If the synthetic neighbor would have predicted $500K and your house is now $540K, the renovation added ~$40K.

Why it's credible: You're not assuming any single comparison is good β€” you're letting the data find the best combination. And you can check: does the synthetic version actually track you well in the pre-period? If it doesn't, you know the method won't work here.

Uber In-App Tipping Launch β€” Austin Market Pilot

Business Context: Uber launched in-app tipping in Austin as a single-market pilot. The goal: measure the causal effect on driver supply (online hours per week). A/B testing is impossible β€” tipping is a market-level feature that affects all participants. There is no single city that perfectly matches Austin's marketplace dynamics (tech-heavy, college town, unique regulatory history from the 2016 Uber ban).

Data Setup: Treated unit: Austin. Donor pool: 25 mid-size US cities without tipping. Pre-treatment period: 40 weeks. Post-treatment: 12 weeks. Outcome: weekly driver online hours (per 1K population). Matching predictors: average fare, driver-to-rider ratio, weekday/weekend split, temperature, competitor presence (Lyft share), population growth rate.

Methodology: Optimize convex weights over the donor pool to minimize pre-treatment MSPE (mean squared prediction error) for Austin's driver hours series. Result: Synthetic Austin = 0.38 Γ— Houston + 0.31 Γ— Denver + 0.22 Γ— Nashville + 0.09 Γ— San Antonio. Pre-treatment fit: RMSPE = 1.2 hours (vs. Austin's mean of 85 hours/week/1K pop). Inference: permutation test β€” apply the same method to every donor city (pretending each was treated), compute the ratio of post/pre RMSPE. Austin's ratio ranks 1st of 26 (p = 0.038).

Results: Post-tipping, Austin's actual driver hours diverge from synthetic Austin by +5.8 hours/week/1K pop (6.8% increase). The effect appears in week 2 and stabilizes by week 5. Placebo tests: no other city shows a comparable post-period gap. Back-of-envelope: 6.8% more driver supply β†’ 2.1% lower average wait times β†’ estimated $3.4M/year in retained rides for the Austin market alone.

Takeaway: This analysis justified the nationwide tipping rollout. Critically, the synthetic control also revealed that the driver supply increase came from existing drivers working longer hours (intensive margin), not new drivers joining (extensive margin) β€” informing the driver engagement team's messaging strategy.

Intuit TV Campaign Impact on Dallas DMA

Business Context: Intuit's marketing team ran a 6-week TV campaign in the Dallas-Fort Worth DMA during tax season, spending $4.2M on spots during local news and prime time. They need to isolate the causal effect of TV advertising on new TurboTax customer acquisitions, but TV campaigns are DMA-level treatments with no within-DMA control group.

Data Setup: Treated: Dallas DMA. Donor pool: 30 DMAs where Intuit ran no TV ads during this period. Pre-treatment: 20 weeks (prior tax season + early current season). Post-treatment: 6 weeks of campaign + 4 weeks after. Outcome: weekly new TurboTax starts per 100K households. Predictors: median household income, broadband penetration, competitor (H&R Block) store density, prior-year TT market share, Hispanic population % (Dallas is a majority-minority DMA).

Results: Synthetic Dallas = 0.41 Γ— Phoenix + 0.33 Γ— Atlanta + 0.18 Γ— Charlotte + 0.08 Γ— Tampa. Pre-treatment RMSPE = 2.3 starts/100K (excellent fit). During the 6-week campaign, Dallas outperforms its synthetic by an average of 14.2 starts/100K/week (+11.8%). In the 4 weeks post-campaign, the gap shrinks to +4.1 starts/100K/week (a "decay" effect suggesting ad recall fading). Total incremental starts: ~38K. At $110 average revenue per new customer: $4.18M revenue vs. $4.2M spend β€” essentially breakeven on first-year revenue, but with positive LTV accounting for multi-year retention.

Takeaway: The synthetic control analysis changed the marketing team's strategy: TV campaigns are breakeven in year 1 but profitable over 3-year LTV. More importantly, the post-campaign decay curve informed optimal flight scheduling β€” the team shifted from one 6-week burst to two 3-week flights with a 2-week gap, increasing sustained awareness while maintaining the same budget.

Python Example

# Synthetic Control Method
import numpy as np
from scipy.optimize import minimize

def synthetic_control(treated_pre, control_pre, control_post):
    """Find weights that best match treated unit's pre-treatment trajectory."""
    n_controls = control_pre.shape[0]

    def objective(w):
        synth = w @ control_pre
        return np.sum((treated_pre - synth)**2)

    # Constraints: weights sum to 1, each β‰₯ 0
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
    bounds = [(0, 1)] * n_controls
    w0 = np.ones(n_controls) / n_controls

    result = minimize(objective, w0, bounds=bounds, constraints=constraints)
    synth_post = result.x @ control_post
    return result.x, synth_post

# Example: Uber tipping feature in Austin
np.random.seed(42)
T_pre, T_post = 12, 8
# Austin (treated) β€” rides per week
austin_pre = 100 + np.cumsum(np.random.normal(1, 2, T_pre))
austin_post = austin_pre[-1] + np.cumsum(np.random.normal(1.5, 2, T_post)) + 7  # +7 treatment

# Donor pool
controls_pre = np.array([
    95 + np.cumsum(np.random.normal(1.1, 2, T_pre)),   # Houston
    80 + np.cumsum(np.random.normal(0.9, 2.5, T_pre)),  # Denver
    70 + np.cumsum(np.random.normal(1.0, 1.8, T_pre)),  # Nashville
])
controls_post = np.array([
    controls_pre[0,-1] + np.cumsum(np.random.normal(1.1, 2, T_post)),
    controls_pre[1,-1] + np.cumsum(np.random.normal(0.9, 2.5, T_post)),
    controls_pre[2,-1] + np.cumsum(np.random.normal(1.0, 1.8, T_post)),
])

weights, synth_post = synthetic_control(austin_pre, controls_pre, controls_post)
effect = austin_post.mean() - synth_post.mean()
print(f"Weights: {dict(zip(['Houston','Denver','Nashville'], weights.round(2)))}")
print(f"Estimated effect: {effect:.1f} rides/week")  # True: ~7
Observational (Endogeneity)

Instrumental Variables (IV)

When treatment is endogenous (confounded in unobservable ways), find an "instrument" β€” a variable that affects the outcome ONLY through its effect on treatment.

Two-Stage Least Squares (2SLS)
Stage 1: TΜ‚ = Ξ± + Ξ³Z + Ξ΅   |   Stage 2: Y = Ξ²β‚€ + Ο„TΜ‚ + Ξ½
Z is the instrument. Ο„ is the causal effect, identified by variation in T driven only by Z.

Statistical Perspective

IV addresses endogeneity: when Cov(T, Ξ΅) β‰  0 in the outcome equation Y = Ο„T + Ξ΅ (unmeasured confounders in Ξ΅ correlate with treatment T). OLS is biased. The instrument Z provides exogenous variation in T β€” variation uncorrelated with Ξ΅.

2SLS: Stage 1 regresses T on Z (and covariates) to get TΜ‚ β€” the "clean" part of treatment driven only by Z. Stage 2 regresses Y on TΜ‚. The coefficient on TΜ‚ is consistent for the causal effect because TΜ‚ inherits Z's exogeneity. The Wald estimator (for binary Z) simplifies to: Ο„ = Cov(Y,Z)/Cov(T,Z) = [E(Y|Z=1)βˆ’E(Y|Z=0)] / [E(Y|Z=1)βˆ’E(Y|Z=0) for T].

Estimand: Under heterogeneous effects, IV estimates the LATE (Local Average Treatment Effect) β€” the effect for "compliers" (those whose treatment status is shifted by the instrument). This is NOT the ATE. Weak instruments (low first-stage F) cause 2SLS to be biased toward OLS. Rule of thumb: F > 10 (Stock & Yogo). Modern: use Anderson-Rubin confidence sets that are valid even with weak instruments.

Intuitive Perspective

You want to know if exercise causes weight loss. But people who exercise also eat better, sleep more, and are generally more health-conscious. You can't separate the effect of exercise from these other factors.

The instrument: Find something that makes people exercise MORE (or LESS) but doesn't directly affect weight through any other channel. Example: living near a gym. People near gyms exercise more (relevance), but living near a gym doesn't directly change your metabolism or diet (exclusion restriction). It only affects your weight by getting you to the gym more often.

The logic: Compare weight outcomes for people near vs. far from gyms. Any weight difference must flow through the exercise channel (since distance doesn't affect weight otherwise). Divide the weight difference by the exercise difference β†’ causal effect of exercise on weight.

The catch: You're only measuring the effect for people whose behavior actually changes based on gym proximity ("compliers"). Gym rats will exercise regardless; couch potatoes won't go even if the gym is next door. IV tells you about the "marginal" people in between β€” which is often exactly the policy-relevant group.

Uber Does Wait Time Cause Rider Churn? Using Rain as an Instrument

Business Context: Uber's retention team observes that riders who experience longer wait times have lower 30-day retention. But this is classic endogeneity: impatient riders (who are also more likely to churn for other reasons) tend to request rides during peak hours when wait times are longest. Patience, mood, and lifestyle are unmeasured confounders affecting both the "treatment" (wait time) and the outcome (retention). An experiment that deliberately delays pickups is unethical and would cause immediate rider loss.

The Instrument β€” Rain: Sudden rainfall reduces driver supply (fewer drivers want to drive in bad weather), which increases wait times. Critically, rain is plausibly exogenous to a rider's underlying loyalty β€” it doesn't directly make you like or dislike Uber (exclusion restriction). Rain β†’ ↓ Driver Supply β†’ ↑ Wait Time β†’ ? Retention. The only channel through which rain affects retention is via wait time.

Data Setup: 1.2M ride requests across 15 cities over 6 months. Instrument Z: binary indicator for unexpected rainfall (actual precipitation > forecasted by β‰₯ 0.5 inches, within Β±2 hours of ride request). Treatment T: actual wait time (continuous, minutes). Outcome Y: binary 30-day retention (β‰₯1 ride in next 30 days). Covariates: city, day-of-week, hour, rider tenure.

First Stage: Rain β†’ Wait Time: coefficient = +3.8 minutes (F-stat = 847, far above the weak instrument threshold of 10). Unexpected rain adds ~4 minutes to average wait times.

Results: Naive OLS: each +1 minute wait β†’ βˆ’0.4pp retention (biased toward zero because impatient people both wait less AND churn more). IV/2SLS: each +1 minute wait β†’ βˆ’1.2pp retention (3x larger than OLS). The true causal harm of wait time was being masked by confounding. At Uber's scale, reducing average wait times by 1 minute would retain ~180K additional monthly active riders, worth ~$43M/year in gross bookings.

Robustness: Over-identification test using a second instrument (local sports events that pull drivers to stadium areas). Both instruments give consistent estimates (Hansen J p = 0.34). Exclusion restriction argument: rain might affect rider mood β†’ different retention. Sensitivity analysis shows results hold unless rain has a direct effect on retention β‰₯ 0.8pp (implausible for a single rainy ride).

Intuit Expert Queue Time as an Instrument for Live Consultation Impact

Business Context: TurboTax Live lets users connect with a CPA. Users who engage complete filing at 89% vs. 71% for non-engagers. But self-selection is severe: motivated users with high-stakes returns both seek help AND try harder to complete. OLS with observed covariates can't fully control for motivation, tax anxiety, or "amount at stake" (partially unobservable).

The Instrument β€” Queue Wait Time: Expert availability varies quasi-randomly throughout the day based on expert shift schedules, concurrent session load, and timezone coverage gaps. A user who clicks "Talk to Expert" at 2:47 PM PST might face a 2-minute wait; the same user clicking at 2:52 PM might face a 25-minute wait. This variation in queue time affects whether users actually connect (relevance) but is plausibly unrelated to a user's underlying filing ability or motivation (exclusion).

Data Setup: 90K users who clicked the "Talk to Expert" button (intent-to-treat population). Instrument Z: expert queue wait time at the moment of click (minutes). Treatment T: binary, actually completed a consultation (users who face long queues abandon the queue). Outcome Y: filing completion. Covariates: time-of-day, day-of-week, product tier, form complexity.

First Stage: Each +5 min queue wait β†’ βˆ’8.3pp probability of completing consultation (F-stat = 312). Users are very sensitive to queue times β€” 40% abandon after 10 minutes.

Results: OLS: expert consultation β†’ +18pp completion (severely upward biased). IV/2SLS: +11.2pp (95% CI: [7.8, 14.6]). The true causal effect is 11pp β€” still large, but 40% of the naive estimate was self-selection bias. This recalibrated the ROI model for expert staffing: at $35/session and 11pp incremental completion ($13.20 incremental revenue), the margin is thinner than the OLS-based model suggested. It shifted investment toward reducing queue times (which the first stage showed have a huge effect on uptake) rather than simply hiring more experts.

Python Example

# Instrumental Variables β€” 2SLS
import numpy as np
from sklearn.linear_model import LinearRegression

def two_stage_ls(y, treatment, instrument, covariates=None):
    """Two-Stage Least Squares IV estimator."""
    Z = instrument.reshape(-1, 1) if instrument.ndim == 1 else instrument
    if covariates is not None:
        Z = np.column_stack([Z, covariates])

    # Stage 1: Regress treatment on instrument
    stage1 = LinearRegression().fit(Z, treatment)
    t_hat = stage1.predict(Z)
    first_stage_f = (stage1.score(Z, treatment) * len(y)) / Z.shape[1]

    # Stage 2: Regress outcome on predicted treatment
    X2 = t_hat.reshape(-1, 1)
    if covariates is not None:
        X2 = np.column_stack([X2, covariates])
    stage2 = LinearRegression().fit(X2, y)

    return {'iv_estimate': stage2.coef_[0], 'first_stage_f': first_stage_f}

# Example: Rain β†’ driver supply β†’ wait time β†’ retention
np.random.seed(42)
n = 5000
rain = np.random.binomial(1, 0.3, n)           # Instrument
unobs_loyalty = np.random.normal(0, 1, n)     # Unobserved confounder
wait_time = 5 + 4*rain + 1*unobs_loyalty + np.random.normal(0, 2, n)  # Treatment
retention = 80 - 1.2*wait_time + 5*unobs_loyalty + np.random.normal(0, 3, n)

# Naive OLS (biased β€” omits loyalty)
naive = LinearRegression().fit(wait_time.reshape(-1,1), retention)
print(f"Naive OLS: {naive.coef_[0]:.2f}")  # Biased toward 0 (confounder)

# IV estimate (unbiased)
iv = two_stage_ls(retention, wait_time, rain)
print(f"IV estimate: {iv['iv_estimate']:.2f}")  # Close to -1.2

Key Assumptions

  • Relevance: Instrument must be correlated with treatment (first-stage F > 10 rule of thumb)
  • Exclusion Restriction: Instrument affects outcome ONLY through treatment β€” untestable and must be argued
  • Independence: Instrument is as-good-as-random (not correlated with unobserved confounders)
  • Monotonicity: Instrument shifts everyone in the same direction (no "defiers")
Mechanism

Mediation Analysis (ACME / ADE)

Decompose a total effect into the indirect effect (through a mediator) and the direct effect (everything else). Answers "WHY did this treatment work?"

Treatment Mediator Outcome a b c' (direct) ACME = a Γ— b (indirect)   |   ADE = c' (direct)   |   Total = ACME + ADE

Statistical Perspective

Mediation analysis decomposes the total effect into Natural Direct Effect (NDE/ADE) and Natural Indirect Effect (NIE/ACME) using the potential outcomes framework. Define M(t) as the mediator value under treatment t, and Y(t, m) as the outcome under treatment t and mediator m. Then:

NIE = E[Y(1, M(1)) βˆ’ Y(1, M(0))] β€” what happens if you "flip" the mediator to its treatment value while holding treatment constant. NDE = E[Y(1, M(0)) βˆ’ Y(0, M(0))] β€” the treatment effect if the mediator were held at its control value.

Sequential ignorability (Imai et al.): requires (1) treatment is randomly assigned (or ignorable conditional on covariates), AND (2) the mediator is ignorable conditional on treatment and pre-treatment covariates. Condition (2) is extremely strong β€” it rules out ANY post-treatment confounders of the mediator-outcome relationship. Sensitivity analysis (varying ρ, the correlation between mediator and outcome error terms) is essential.

In the linear case with no interaction: ACME = Γ’ Γ— bΜ‚ (product of coefficients). With interaction: ACME depends on treatment status. Bootstrap CIs are preferred over Sobel's test (which assumes normality of the product).

Intuitive Perspective

A new drug reduces blood pressure. Great β€” but why? Does it relax blood vessels directly? Or does it reduce stress hormones, which then lowers blood pressure? Knowing the "why" matters: if it works entirely through stress hormones, you might find a better (cheaper, fewer side effects) way to target those hormones directly.

Mediation's question: How much of Drug β†’ Blood Pressure goes through Drug β†’ Stress Hormones β†’ Blood Pressure (indirect path) vs. Drug β†’ Blood Pressure directly?

How it works: (1) Measure how much the drug changes stress hormones (path a). (2) Measure how much stress hormones affect blood pressure, controlling for the drug (path b). (3) Multiply: a Γ— b = the indirect effect (ACME). (4) Whatever's left over from the total effect is the direct effect (ADE).

In business: Your A/B test won. Mediation tells you which feature of the change drove the win. This is the difference between "the redesign worked, do more redesigns" (vague) and "65% of the win came from the ETA display, invest there specifically" (actionable). It turns experiment results into a product roadmap.

Uber Rider App Redesign β€” Which Mechanism Drives the Lift?

Business Context: Uber's rider app team shipped a major home screen redesign that increased ride request rate by 4.2pp in an A/B test. Leadership is deciding where to allocate the next quarter's engineering resources. The redesign had two major changes: (1) a new real-time ETA display with animated driver icons, and (2) a simplified, one-tap ride request flow. Which change deserves credit? Building more ETA features is a different roadmap than simplifying UX.

Data Setup: 500K users in the A/B test. Treatment T: new home screen (binary). Mediator M: user's perceived wait time, measured via a post-request survey ("How long do you think your pickup will take?" β€” collected for 30% of rides). Outcome Y: binary ride request within the session. Key: the mediator is measured AFTER treatment assignment but BEFORE the outcome (the survey fires when a user opens the app but before they decide to request).

Methodology: Causal mediation analysis using the potential outcomes framework. Step 1: Regress M on T β†’ the new design reduces perceived wait by 1.8 minutes (path a). Step 2: Regress Y on T and M β†’ each minute of perceived wait reduces request probability by 1.4pp (path b), and the direct effect of the redesign holding perceived wait constant is +1.5pp (path c'). ACME = a Γ— b = 1.8 Γ— 1.4 = 2.52pp. ADE = 1.5pp. Total = 4.02pp (close to the observed 4.2pp).

Results: 63% of the total effect is mediated through perceived wait time (ACME = 2.5pp), while 37% is the direct effect of the simpler UX (ADE = 1.5pp). Bootstrap 95% CI for proportion mediated: [52%, 74%]. Sensitivity analysis (Imai et al. ρ parameter): results hold unless the correlation between unmeasured confounders of Mβ†’Y exceeds ρ = 0.35.

Takeaway: The mediation analysis changed the Q2 roadmap. Instead of further UX simplification (which would have addressed only the 37% ADE), the team invested in: (1) more granular ETA predictions (driver-specific models), (2) animated progress indicators, and (3) under-promise/over-deliver ETA calibration. The follow-up experiment on these ETA-focused improvements yielded an additional 3.1pp lift β€” validating the mediation finding.

Intuit Simplified Tax Interview β†’ User Confidence β†’ Filing Completion

Business Context: TurboTax redesigned the tax interview to use plain-language questions instead of tax jargon (e.g., "Did you have a job?" instead of "Do you have W-2 income?"). An A/B test showed a 5.1pp lift in filing completion. But the product team has competing theories for WHY it works: (A) users feel more confident because they understand the questions, leading them to continue; (B) the simpler interface reduces cognitive load, making it faster to finish; (C) fewer jargon terms means fewer users Googling mid-flow and getting lost. Understanding the mechanism determines the next investment.

Data Setup: Treatment T: new plain-language interview (binary). Mediator M: user confidence score β€” measured via a 3-question in-app micro-survey at the 40% completion mark ("How confident are you that you're entering information correctly?" 1–7 scale). This measurement point is after enough treatment exposure to affect confidence but early enough that many users haven't yet decided to abandon. Outcome Y: filing completion. n = 120K users (60K per arm, 35% survey response rate β†’ 42K with mediator data).

Methodology: Path a: New interview β†’ +0.9 points on confidence scale (p < 0.001). Path b: Each +1 confidence point β†’ +3.1pp completion probability (controlling for treatment). Path c' (ADE): +2.3pp direct effect on completion (holding confidence constant). ACME = 0.9 Γ— 3.1 = 2.79pp. ADE = 2.3pp. Total = 5.09pp. Proportion mediated = 55%.

Sensitivity & Robustness: (1) Selection into survey: compare survey responders vs. non-responders on observables β€” no significant differences. (2) Sequential ignorability: the critical untestable assumption. A post-treatment confounder like "tax situation complexity" could affect both confidence and completion. Include form count and AGI as covariates in the mediator model β€” ACME drops slightly to 2.4pp (still 47% mediated). (3) Alternative mediator: time-to-40%-completion (cognitive load proxy) mediates an additional 18% of the total effect, partially overlapping with confidence.

Takeaway: Confidence is the dominant mechanism (55%), cognitive load adds another 18%, and the remaining ~27% is through other pathways (possibly the "less Googling" hypothesis, which they couldn't directly measure). The product team's Q2 investment: (1) expand plain-language to the review and filing sections (confidence pathway), (2) add "you're doing great" progress affirmations at key milestones, (3) contextual tooltips that explain why a question is asked (addressing remaining uncertainty). The follow-up experiment on confidence-focused features yielded +2.8pp additional completion lift.

The Baron & Kenny Linear Regression Approach (Classical Method)

This is the simplest and most widely-used approach. Mediation analysis reduces to 3 ordinary linear regressions β€” nothing more. Here's exactly how it works, step by step.

The 3 Regressions

REGRESSION 1 β€” Total Effect
Y = Ξ²β‚€ + cΒ·T + Ρ₁
Regress outcome on treatment only. The coefficient c is the total effect.
REGRESSION 2 β€” Treatment β†’ Mediator
M = Ξ²β‚€ + aΒ·T + Ξ΅β‚‚
Regress mediator on treatment. The coefficient a is how much treatment moves the mediator.
REGRESSION 3 β€” Both β†’ Outcome
Y = Ξ²β‚€ + c'Β·T + bΒ·M + Ρ₃
Regress outcome on treatment AND mediator. c' = direct effect (ADE). b = mediator's effect on outcome.

Then Just Multiply

INDIRECT EFFECT (ACME)
ACME = a Γ— b
DIRECT EFFECT (ADE)
ADE = c'
VERIFICATION
c = aΓ—b + c'
Total = Indirect + Direct (must add up)
% MEDIATED
(a Γ— b) / c

Baron & Kenny's 4 Conditions

  1. c is significant β€” treatment affects outcome (total effect exists)
  2. a is significant β€” treatment affects mediator
  3. b is significant β€” mediator affects outcome (controlling for treatment)
  4. c' < c β€” direct effect shrinks when mediator is included. If c' = 0 β†’ full mediation. If 0 < c' < c β†’ partial mediation.

Python β€” Baron & Kenny with statsmodels (Full Output with p-values)

# Baron & Kenny Mediation β€” The Linear Regression Approach
# This is the classical method: just 3 regressions + a multiplication
import numpy as np
import statsmodels.api as sm

np.random.seed(42)
n = 3000

# ── Simulate: TurboTax simplified interview β†’ confidence β†’ completion ──
T = np.random.binomial(1, 0.5, n).astype(float)  # Treatment: new interview
M = 4.0 + 0.9 * T + np.random.normal(0, 1.2, n)    # Mediator: confidence (1-7)
Y = 0.3 + 0.023 * T + 0.031 * M + np.random.normal(0, 0.15, n)  # Outcome: completion

# ── REGRESSION 1: Total Effect (Y ~ T) ──
reg1 = sm.OLS(Y, sm.add_constant(T)).fit()
c_total = reg1.params[1]
print("═══ Regression 1: Y ~ T (Total Effect) ═══")
print(f"  c (total effect) = {c_total:.4f},  p = {reg1.pvalues[1]:.4f}")

# ── REGRESSION 2: Treatment β†’ Mediator (M ~ T) ──
reg2 = sm.OLS(M, sm.add_constant(T)).fit()
a = reg2.params[1]
print(f"\n═══ Regression 2: M ~ T (Treatment β†’ Mediator) ═══")
print(f"  a = {a:.4f},  p = {reg2.pvalues[1]:.4f}")

# ── REGRESSION 3: Both β†’ Outcome (Y ~ T + M) ──
X3 = sm.add_constant(np.column_stack([T, M]))
reg3 = sm.OLS(Y, X3).fit()
c_prime = reg3.params[1]  # Direct effect (ADE)
b = reg3.params[2]        # Mediator β†’ Outcome
print(f"\n═══ Regression 3: Y ~ T + M (Direct + Mediator) ═══")
print(f"  c' (direct/ADE) = {c_prime:.4f},  p = {reg3.pvalues[1]:.4f}")
print(f"  b  (M β†’ Y)      = {b:.4f},  p = {reg3.pvalues[2]:.4f}")

# ── DECOMPOSITION ──
acme = a * b
total = acme + c_prime
pct = acme / total * 100
print(f"\n═══ Mediation Decomposition ═══")
print(f"  Indirect (ACME) = a Γ— b = {a:.4f} Γ— {b:.4f} = {acme:.4f}")
print(f"  Direct   (ADE)  = c'    = {c_prime:.4f}")
print(f"  Total            = {total:.4f}  (should β‰ˆ c = {c_total:.4f})")
print(f"  % Mediated       = {pct:.1f}%")

# ── BARON & KENNY 4 CONDITIONS ──
print(f"\n═══ Baron & Kenny Checklist ═══")
print(f"  1. c  significant?  {'βœ“' if reg1.pvalues[1] < 0.05 else 'βœ—'}  (p={reg1.pvalues[1]:.4f})")
print(f"  2. a  significant?  {'βœ“' if reg2.pvalues[1] < 0.05 else 'βœ—'}  (p={reg2.pvalues[1]:.4f})")
print(f"  3. b  significant?  {'βœ“' if reg3.pvalues[2] < 0.05 else 'βœ—'}  (p={reg3.pvalues[2]:.4f})")
print(f"  4. c' < c?          {'βœ“' if abs(c_prime) < abs(c_total) else 'βœ—'}  ({c_prime:.4f} < {c_total:.4f})")
print(f"  β†’ {'Full' if reg3.pvalues[1] > 0.05 else 'Partial'} mediation")

When Linear Regression Is Enough vs. When You Need More

Linear OLS is fine when:

  • Treatment is randomized (A/B test)
  • Mediator and outcome are continuous
  • Relationships are approximately linear
  • No treatment Γ— mediator interaction (the effect of M on Y doesn't change by treatment group)
  • You're okay with Sobel's test or bootstrap for indirect effect CI

You need something fancier when:

  • Binary outcome (e.g., completed filing yes/no) β†’ logistic regression; aΓ—b doesn't decompose cleanly on the probability scale
  • Treatment Γ— mediator interaction β†’ ACME becomes treatment-dependent: ACME(t=1) β‰  ACME(t=0)
  • Multiple mediators β†’ SEM or path analysis to handle correlated mediators (confidence AND cognitive load)
  • Post-treatment confounders β†’ sensitivity analysis (Imai's ρ) or IV-based mediation
  • Non-linear mediator relationships β†’ nonparametric / semiparametric estimators

Python β€” Bootstrap + sklearn (Original Approach)

# Causal Mediation Analysis β€” sklearn version with bootstrap CIs
import numpy as np
from sklearn.linear_model import LinearRegression

def mediation_analysis(treatment, mediator, outcome, n_bootstrap=1000):
    """Estimate ACME, ADE, and Total Effect with bootstrap CIs."""
    n = len(treatment)
    T, M, Y = treatment.reshape(-1,1), mediator, outcome

    # Path a: Treatment β†’ Mediator
    model_m = LinearRegression().fit(T, M)
    a = model_m.coef_[0]

    # Paths b and c': [Treatment, Mediator] β†’ Outcome
    X_full = np.column_stack([treatment, mediator])
    model_y = LinearRegression().fit(X_full, Y)
    c_prime = model_y.coef_[0]  # ADE (direct)
    b = model_y.coef_[1]

    acme = a * b     # Average Causal Mediation Effect (indirect)
    ade = c_prime     # Average Direct Effect
    total = acme + ade

    # Bootstrap confidence intervals
    acme_boot = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, n, replace=True)
        m_b = LinearRegression().fit(T[idx], M[idx])
        X_b = np.column_stack([treatment[idx], mediator[idx]])
        y_b = LinearRegression().fit(X_b, Y[idx])
        acme_boot.append(m_b.coef_[0] * y_b.coef_[1])

    ci = np.percentile(acme_boot, [2.5, 97.5])
    return {'ACME': acme, 'ADE': ade, 'Total': total,
            'Prop_Mediated': acme/total, 'ACME_CI': ci}

# Example: Uber app redesign β†’ ETA perception β†’ ride requests
np.random.seed(42)
n = 3000
treatment = np.random.binomial(1, 0.5, n).astype(float)
# Mediator: perceived wait time (lower = better)
perceived_wait = 8 - 2*treatment + np.random.normal(0, 2, n)
# Outcome: ride request probability
ride_request = 0.5 + 0.05*treatment - 0.03*perceived_wait + np.random.normal(0, 0.1, n)

result = mediation_analysis(treatment, perceived_wait, ride_request)
print(f"ACME (indirect via ETA): {result['ACME']:.4f}")
print(f"ADE  (direct effect):    {result['ADE']:.4f}")
print(f"Total Effect:            {result['Total']:.4f}")
print(f"% Mediated:              {result['Prop_Mediated']:.1%}")
print(f"ACME 95% CI:             [{result['ACME_CI'][0]:.4f}, {result['ACME_CI'][1]:.4f}]")

Key Assumptions

  • Sequential Ignorability: (1) Treatment is randomly assigned, AND (2) mediator is as-good-as-random conditional on treatment and observed covariates β€” this is VERY strong
  • No treatment-mediator interaction: In the simple linear case; extensions handle interactions
  • Correct model specification: Misspecified mediator or outcome model β†’ biased decomposition
  • No post-treatment confounders: No variable affected by treatment that also confounds the mediator-outcome relationship
Heterogeneous Effects

Uplift Modeling / Heterogeneous Treatment Effects (HTE)

Not everyone responds the same way. Uplift modeling estimates the Conditional Average Treatment Effect (CATE) β€” the treatment effect for each individual or segment.

Conditional Average Treatment Effect
Ο„(x) = E[Y(1) βˆ’ Y(0) | X = x]
Meta-learners (S, T, X-learner) and causal forests estimate this function.

Statistical Perspective

Standard A/B tests estimate the ATE = E[Y(1)βˆ’Y(0)]. Uplift modeling estimates the CATE (Conditional Average Treatment Effect) Ο„(x) = E[Y(1)βˆ’Y(0)|X=x] β€” the treatment effect as a function of individual characteristics. The fundamental challenge: you observe Y(1) or Y(0) for each unit, never both.

Meta-learners: T-Learner trains separate outcome models for treatment and control: Ο„Μ‚(x) = μ̂₁(x) βˆ’ ΞΌΜ‚β‚€(x). S-Learner trains one model with treatment as a feature: ΞΌΜ‚(x,t) and computes Ο„Μ‚(x) = ΞΌΜ‚(x,1) βˆ’ ΞΌΜ‚(x,0). X-Learner is a three-stage procedure that imputes individual treatment effects and uses propensity scores to weight between treatment and control group estimates β€” more efficient when groups are imbalanced.

DR-Learner (doubly robust) combines propensity scoring with outcome modeling for robustness. Causal forests (Athey & Imbens) adapt random forests to split on treatment effect heterogeneity rather than outcome prediction. Evaluation: the uplift curve (Qini curve) β€” sort users by predicted CATE, plot cumulative incremental outcome vs. fraction targeted.

Intuitive Perspective

A store sends everyone a 20%-off coupon and sees a 5% sales lift overall. But not everyone is the same:

β€’ "Sure Things" β€” would have bought anyway. The coupon just costs you margin.
β€’ "Persuadables" β€” buy BECAUSE of the coupon. This is where your ROI lives.
β€’ "Lost Causes" β€” won't buy regardless. Coupon is wasted.
β€’ "Sleeping Dogs" β€” the coupon actually annoys them (reminds them to unsubscribe). Negative effect.

Uplift modeling finds these groups. For each customer, it asks: "What's the DIFFERENCE in purchase probability with vs. without the coupon?" β€” not just "Will they buy?" A high-probability buyer isn't necessarily someone you should target (they might be a "Sure Thing"). You want the people with the biggest gap between coupon and no-coupon worlds.

The magic: By targeting only "Persuadables" (say top 30% by predicted uplift), you get 80% of the incremental sales at 30% of the cost. The remaining 70% of coupons were either wasted (Sure Things, Lost Causes) or actively harmful (Sleeping Dogs).

Uber $5 Promo Optimization β€” Finding the Persuadables

Business Context: Uber spends $200M+/year on rider promos (discounts, free rides, credits). The blunt approach: blast $5 credits to all riders who haven't ridden in 14+ days. Marketing suspects they're wasting money on two groups: (1) "sure things" who would have come back anyway, and (2) "lost causes" who won't return even with $5. They need to find the "persuadables" β€” users where the promo actually changes behavior.

Data Setup: Randomized promo experiment: 1M lapsed riders (14+ days since last ride), 50/50 split between $5 credit (treatment) and no credit (control). Features for the CATE model: (1) days since last ride, (2) lifetime ride count, (3) average trip value, (4) urban/suburban/rural, (5) last ride rating, (6) device type, (7) signup channel, (8) local competitor presence, (9) time since signup. Outcome: ride within 7 days (binary).

Methodology: Train a T-Learner (separate GBM models for treatment and control groups) to estimate Ο„(x) = E[Y|T=1,X=x] βˆ’ E[Y|T=0,X=x] for each user. Validate using the "uplift curve" on a held-out test set: rank users by predicted CATE, then plot cumulative incremental rides vs. fraction of population targeted. Compare to random targeting baseline.

Results: Average treatment effect (ATE) across all users: +3.2pp ride probability. But the distribution is highly skewed:
β€’ Top decile (persuadables): CATE = +12.4pp. Profile: suburban, 30–90 days lapsed, 10–30 lifetime rides, signed up via referral.
β€’ Middle 60%: CATE = +2–4pp. Marginal ROI β€” promo barely covers its cost.
β€’ Bottom 20%: CATE = βˆ’0.5pp to +0.5pp. "Sure things" (daily commuters, CATE β‰ˆ 0 because they'd ride anyway) and "lost causes" (1-ride-and-done users).
β€’ Surprise finding: 8% of users have negative CATE β€” the promo email actually reduces their ride probability (possibly because it reminds them they stopped using Uber for a reason).

Takeaway: By targeting only the top 30% by CATE, Uber achieves 78% of the total incremental rides at 30% of the promo cost. Annualized savings: $47M in promo spend. The negative-CATE finding led to removing those users from all promo campaigns (including email), reducing unsubscribe rates by 15%.

Intuit SKU Upsell Nudge β€” Who Benefits vs. Who Gets Annoyed?

Business Context: TurboTax shows an upgrade nudge ("Unlock Deluxe for $59 β€” maximize your deductions") to Free Edition users at the deductions section. The average take-rate is 5.2%. But product managers suspect the nudge might be hurting filing completion for users who don't need Deluxe β€” the interruption causes confusion or frustration. They want to personalize: show the nudge only to users where it helps, suppress it where it hurts.

Data Setup: Randomized experiment: 800K Free Edition users, 50% see the nudge, 50% see nothing. Features: (1) has Schedule C income, (2) number of deductions started, (3) estimated refund amount so far, (4) prior-year product tier, (5) time spent on return so far, (6) state, (7) filing status, (8) device, (9) entry channel. Two outcomes: (a) upgrade to Deluxe, (b) filing completion.

Methodology: X-Learner (more efficient than T-Learner when treatment group is smaller after upgrade). Estimate CATE on both outcomes separately. Cross-validate with DR-Learner (doubly-robust) for robustness. Segment the CATE distribution into 4 groups using the "persuasion-annoyance" framework: High-Upgrade-CATE + Neutral-Completion-CATE = "Sweet Spot"; High-Upgrade + Negative-Completion = "Costly Conversion"; Low-Upgrade + Negative-Completion = "Just Annoying"; Low-Upgrade + Positive-Completion = "Ignore Nudge."

Results:
β€’ "Sweet Spot" (22% of users): Schedule C filers, multiple deduction types, CATE_upgrade = +14.8pp, CATE_completion = +0.3pp. They genuinely need Deluxe and the nudge helps them find the right product.
β€’ "Costly Conversion" (11%): Users with moderate complexity who upgrade but then face Deluxe's more detailed interview and abandon. CATE_upgrade = +8.2pp, CATE_completion = βˆ’3.1pp. Net revenue negative after accounting for lost completions.
β€’ "Just Annoying" (31%): Simple W-2 filers. CATE_upgrade = +1.1pp, CATE_completion = βˆ’2.4pp. The nudge interrupts their flow for minimal conversion.
β€’ "Ignore Nudge" (36%): CATE_upgrade = +0.4pp, CATE_completion = +0.1pp. No effect either way.

Takeaway: Showing the nudge only to "Sweet Spot" users: upgrade revenue stays at 83% of the blanket-nudge approach, but filing completion improves by 1.4pp overall (recovering the completions lost from the "Annoying" and "Costly" segments). Net impact: +$3.2M revenue per tax season. The "Costly Conversion" finding was especially important β€” it led to a new "Deluxe Preview" flow that lets users see what additional Deluxe questions they'd face before committing.

Python Example β€” T-Learner

# Uplift Modeling with T-Learner
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np, pandas as pd

def t_learner(X_train, treatment, y_train, X_pred):
    """T-Learner: separate models for treated and control."""
    # Model for treated
    m1 = GradientBoostingRegressor(n_estimators=100)
    m1.fit(X_train[treatment == 1], y_train[treatment == 1])

    # Model for control
    m0 = GradientBoostingRegressor(n_estimators=100)
    m0.fit(X_train[treatment == 0], y_train[treatment == 0])

    # CATE = E[Y|T=1,X] - E[Y|T=0,X]
    cate = m1.predict(X_pred) - m0.predict(X_pred)
    return cate

# Example: Uber promo targeting
np.random.seed(42)
n = 10000
X = pd.DataFrame({
    'days_since_last_ride': np.random.exponential(14, n),
    'lifetime_rides': np.random.poisson(20, n),
    'is_suburban': np.random.binomial(1, 0.4, n),
    'avg_trip_value': np.random.normal(15, 5, n),
})
treatment = np.random.binomial(1, 0.5, n)

# True CATE: high for lapsed suburban riders, low for frequent urban
true_cate = (0.02 + 0.08 * (X['days_since_last_ride'] > 14).astype(float)
             + 0.05 * X['is_suburban']
             - 0.001 * X['lifetime_rides'])
y = 0.3 + treatment * true_cate + np.random.normal(0, 0.1, n)

cate_hat = t_learner(X.values, treatment, y, X.values)
# Target top 20% CATE for promo
threshold = np.percentile(cate_hat, 80)
targeted = cate_hat >= threshold
print(f"Avg CATE (all):      {cate_hat.mean():.4f}")
print(f"Avg CATE (targeted): {cate_hat[targeted].mean():.4f}")
print(f"Avg CATE (others):   {cate_hat[~targeted].mean():.4f}")
Time Series

Bayesian Structural Time Series (BSTS) / CausalImpact

Build a Bayesian time series model of what WOULD have happened without intervention. The gap between predicted counterfactual and observed data is the causal effect.

Causal Impact
Effect(t) = Y_observed(t) βˆ’ Y_counterfactual(t)
Y_counterfactual is predicted by a state-space model fit on pre-intervention data + control time series.

Statistical Perspective

BSTS (Brodersen et al., 2015 β€” Google's CausalImpact) models the treated unit's time series as a state-space model with three components: (1) local linear trend (level + slope, both evolving as random walks), (2) seasonal component (fixed or evolving), (3) regression on control time series (untreated units whose trajectories help predict the treated unit).

The model is fit on the pre-intervention period via Bayesian MCMC. Spike-and-slab priors on regression coefficients perform automatic variable selection among control series. Post-intervention, the model generates posterior predictive draws of the counterfactual. The causal effect at each time point is: actual βˆ’ counterfactual. Because it's Bayesian, you get a full posterior distribution over the effect (not just a point estimate and CI), including P(effect > 0).

Advantages over synthetic control: (1) Handles trends and seasonality explicitly. (2) Provides time-varying effects (you see the effect emerge and decay). (3) Bayesian uncertainty quantification. Disadvantage: More model assumptions (distributional, structural); synthetic control is more "nonparametric."

Intuitive Perspective

Imagine you started running every morning and want to know if it improved your sleep. You have 6 months of sleep data before running and 2 months after. You also track your friend's sleep (who didn't start running) and the local temperature.

BSTS builds a prediction model: Using your pre-running sleep patterns, seasonal habits, and your friend's sleep (as a "control"), it learns to predict your sleep quality. Then it asks: "Based on everything I know, what WOULD your sleep have looked like these past 2 months if you hadn't started running?"

The counterfactual: The model might say "you would have slept 6.5 hours/night." You actually slept 7.2 hours. The gap (0.7 hours) is the estimated effect of running. Your friend's data is key β€” if everyone's sleep improved in spring (seasonal), the model accounts for that and doesn't credit it to your running.

Why Bayesian matters: Instead of saying "the effect is 0.7 hours Β± 0.3," it says "there's a 96% probability the effect is positive, and the most likely range is 0.4–1.0 hours." This is much more useful for decision-making: "I'm 96% sure running helps my sleep" is a clearer statement than "p = 0.04."

Uber Chicago Billboard Campaign β€” Time Series Counterfactual

Business Context: Uber's brand marketing team ran a 4-week outdoor billboard campaign in Chicago ("Your Uber is 3 minutes away" with real-time ETAs on digital billboards). Total spend: $1.8M. The challenge: no pre-registered hold-out (marketing decided late to measure impact), so there's no clean control group. Uber needs to estimate what Chicago's ride volume would have been without the campaign.

Data Setup: 52 weeks of pre-campaign weekly ride data for Chicago + 20 control cities with no billboard campaign. Post-treatment: 4 campaign weeks + 4 post-campaign weeks. Control series used as regression predictors: Detroit, Milwaukee, Indianapolis, St. Louis, Minneapolis (selected for similar Midwest seasonality, population density, and ride patterns). Additional regressors: week-of-year seasonal dummies, Chicago temperature, Chicago events calendar.

Methodology: BSTS decomposes Chicago's ride series into: (1) local linear trend, (2) seasonal component (52-week cycle), (3) regression component (weighted combination of control cities). The model is fit on the 52 pre-campaign weeks via Bayesian MCMC (1000 posterior samples). For each posterior draw, predict the counterfactual for the 8 post-campaign weeks. The distribution of (actual βˆ’ counterfactual) gives a full posterior over the causal effect, including uncertainty.

Results: During the 4-week campaign: actual = 138K rides/week, counterfactual posterior mean = 124.5K. Average weekly effect: +13.5K rides (posterior 95% CI: [8.8K, 18.2K]). Posterior probability of positive effect: 99.7%. During the 4 post-campaign weeks: the effect decays from +11K in week 1 to +3K in week 4 (half-life β‰ˆ 2.3 weeks). Cumulative incremental rides: 82K. At $12 average Uber take per ride: $984K incremental revenue vs. $1.8M campaign cost. Short-term ROAS = 0.55 (negative).

Takeaway: The BSTS analysis showed the billboard campaign was unprofitable on direct ride revenue. But the decay curve was informative: awareness effects lasted only ~5 weeks. Marketing shifted budget from billboards to retargeting digital ads (where BSTS on a separate test showed ROAS = 2.3). The Bayesian framework was critical β€” unlike frequentist methods, it gave a full probability distribution over ROI, allowing the finance team to make risk-adjusted budget decisions.

Intuit Mid-Season "Snap & File" Feature Launch β€” Isolating the Impact

Business Context: TurboTax launched "Snap & File" (photograph your W-2 β†’ auto-populated return β†’ file in under 10 minutes) in week 4 of tax season. There's no control group β€” the feature is available to all users. The product team needs to separate the Snap & File lift from: (1) natural seasonal ramp-up, (2) marketing spend increases in January, (3) IRS processing timeline (refund delays affect filing volume). Simply comparing week 4 vs. week 3 conflates all these factors.

Data Setup: Outcome: daily new TurboTax filing starts (nationwide). Pre-treatment: 3 prior tax seasons (TY21–TY23) of daily data + the first 3 weeks of TY24. Control series: (1) H&R Block online starts (public investor data, weekly), (2) IRS total e-file receipts (weekly), (3) Google Trends index for "file taxes" (daily). These control series capture the "background" forces affecting all tax prep, not just TurboTax.

Methodology: BSTS with control series as regressors. The model learns that TurboTax's daily starts are a predictable function of: seasonal pattern (day-of-season), IRS e-file volume (macro readiness), and Google Trends (intent). Spike-and-slab priors on the regression coefficients perform automatic variable selection. Post-Snap-&-File launch, the model predicts the counterfactual (what TurboTax starts would have been with the same seasonal/macro conditions but no Snap & File).

Results: Actual cumulative starts through week 8: 2.35M. Counterfactual: 2.09M. Estimated causal impact of Snap & File: +260K incremental starts (95% CI: [195K, 325K]). The effect is concentrated in the first 2 weeks post-launch (early adopters who were waiting for an easier way to file) and then settles to a steady +15K/week above counterfactual. Decomposition: 70% of Snap & File users are new to TurboTax (acquisition), 30% are returning users who filed earlier than they otherwise would have (pull-forward, which the team verified doesn't cannibalize later-season volume).

Takeaway: 260K incremental starts Γ— 72% completion rate Γ— $85 average revenue = $15.9M incremental revenue from Snap & File in its first season. This justified the $4M engineering investment (4:1 ROI) and made the case for expanding the concept to 1099s and state returns. The BSTS control series were critical β€” without them, the team would have attributed seasonal ramp-up to the feature, overstating the impact by ~40%.

Python Example

# BSTS / CausalImpact-style Analysis
import numpy as np
from sklearn.linear_model import BayesianRidge

def causal_impact_simple(y_treated, y_controls, intervention_idx):
    """Simple CausalImpact: fit on pre-period, predict post counterfactual."""
    # Pre-period: fit model using control series to predict treated
    y_pre = y_treated[:intervention_idx]
    X_pre = y_controls[:, :intervention_idx].T
    X_post = y_controls[:, intervention_idx:].T

    model = BayesianRidge()
    model.fit(X_pre, y_pre)

    # Predict counterfactual for post-period
    y_cf, y_std = model.predict(X_post, return_std=True)
    y_actual = y_treated[intervention_idx:]

    point_effect = y_actual - y_cf
    cumulative_effect = np.cumsum(point_effect)

    return {
        'avg_effect': point_effect.mean(),
        'cumulative': cumulative_effect[-1],
        'ci_lower': (point_effect - 1.96*y_std).mean(),
        'ci_upper': (point_effect + 1.96*y_std).mean(),
        'counterfactual': y_cf,
        'actual': y_actual
    }

# Example: Uber billboard campaign in Chicago
np.random.seed(42)
weeks = 30
intervention = 18  # Campaign starts week 18
trend = np.linspace(100, 130, weeks)

# Control cities (no campaign)
detroit  = trend * 0.7 + np.random.normal(0, 3, weeks)
milwaukee = trend * 0.5 + np.random.normal(0, 2, weeks)
controls = np.array([detroit, milwaukee])

# Chicago (treated) β€” extra 13 rides/week post-campaign
chicago = trend + np.random.normal(0, 3, weeks)
chicago[intervention:] += 13  # True causal effect

result = causal_impact_simple(chicago, controls, intervention)
print(f"Avg weekly effect: {result['avg_effect']:.1f}K rides")
print(f"95% CI: [{result['ci_lower']:.1f}, {result['ci_upper']:.1f}]")
print(f"Cumulative: {result['cumulative']:.0f}K rides over post-period")
Experimental (Non-Compliance)

CACE β€” Complier Average Causal Effect

In experiments where not everyone complies (e.g., assigned to treatment but doesn't take it), CACE estimates the effect among those who actually would comply with their assignment.

CACE (Wald Estimator)
CACE = ITT / Compliance Rate = [E(Y|Z=1) βˆ’ E(Y|Z=0)] / [E(T|Z=1) βˆ’ E(T|Z=0)]
Z = assignment, T = actual treatment taken. CACE = ITT / first-stage effect.

Statistical Perspective

In experiments with non-compliance, there are four latent subpopulations (under the principal strata framework): Compliers (take treatment when assigned, don't when not), Always-takers (take treatment regardless), Never-takers (don't take regardless), and Defiers (do the opposite of assignment). Under the monotonicity assumption (no defiers), the ITT is a mixture: ITT = Ο€_c Β· CACE + Ο€_a Β· 0 + Ο€_n Β· 0, where Ο€_c is the complier fraction.

The Wald estimator CACE = ITT / Ο€_c = ITT / (E[T|Z=1] βˆ’ E[T|Z=0]) is algebraically equivalent to the IV estimator using Z as an instrument for T. This connects to the LATE (Local Average Treatment Effect) framework β€” CACE and LATE are the same concept.

Key point: CACE β‰  ATE. It's the effect specifically for compliers β€” which may differ from the effect on always-takers or never-takers. In one-sided non-compliance (control can't access treatment), always-takers don't exist, and the compliance rate simplifies to E[T|Z=1].

Intuitive Perspective

You mail a gym membership coupon to 1,000 people. Only 300 of them actually go to the gym (the other 700 never use the coupon). After 3 months, the 1,000 coupon-recipients lost an average of 1.5 lbs more than people who didn't get a coupon.

But that 1.5 lbs is diluted β€” it averages in the 700 people who never went! The 300 who actually used the gym probably lost much more. CACE says: 1.5 lbs / 30% compliance = 5 lbs for those who actually went.

Why not just compare gym-goers to non-goers? Because that's biased β€” gym-goers are the motivated ones who would have lost weight anyway (selection bias). CACE avoids this by only using the RANDOM variation in gym access (the coupon), scaled up to account for the people who ignored it.

Two useful numbers from one experiment: The ITT (1.5 lbs) answers "What happens if we mail coupons to everyone?" β€” useful for budgeting the mailing. The CACE (5 lbs) answers "How much does actually going to the gym help?" β€” useful for understanding the intervention's true potency and deciding whether to invest in activation.

Uber Driver Earnings Guarantee β€” Non-Compliance in a Marketplace Experiment

Business Context: Uber tested a new "Earnings Guarantee" program: drivers were promised a minimum of $25/hour during peak hours if they stayed online. The experiment randomly assigned 10K drivers to receive the offer (Z = 1). But only 62% of assigned drivers actually activated the guarantee (T = 1) β€” the rest didn't open the email, didn't understand the terms, or drove during off-peak hours. Uber can't force compliance. The ITT understates the program's potential; the per-protocol analysis (comparing only activators) is biased by self-selection (motivated drivers both activate and drive more hours regardless).

Data Setup: Z: random assignment to receive offer (10K treatment, 10K control). T: actually activated the guarantee (6,200 in treatment group, 0 in control β€” one-sided non-compliance). Y: weekly online hours in the 4 weeks post-assignment. Pre-period: 4 weeks of baseline hours for all drivers.

Methodology: CACE = ITT_Y / ITT_T. ITT_Y: assigned drivers drove 1.85 more hours/week than control (p = 0.003). ITT_T: assignment increased activation probability by 0.62 (62% in treatment vs. 0% in control). CACE = 1.85 / 0.62 = 2.98 hours/week. Interpretation: among "compliers" (drivers who would activate if offered but wouldn't otherwise), the guarantee increases driving by ~3 hours/week.

Results: ITT = +1.85 hrs/wk (the "policy effect" if you offer the program to everyone). CACE = +2.98 hrs/wk (the actual behavioral effect on drivers who engage). Per-protocol (naive): +4.1 hrs/wk (upward biased β€” activators are inherently more motivated). The CACE is the right number for: (1) cost-benefit analysis ($25/hr guarantee cost Γ— 3 extra hours = $75/week per complier, generating ~$90/week in gross bookings β†’ profitable), (2) deciding whether to invest in activation UX vs. the guarantee itself (since CACE is high, the bottleneck is activation, not the incentive's effectiveness).

Takeaway: The CACE analysis separated two problems: "Does the guarantee work for those who use it?" (yes, +3 hrs/wk) and "Can we get more people to use it?" (only 62% activated). The team invested in: (1) in-app push notifications instead of email (activation rose to 78% in the next test), (2) simplified terms ("Drive peak, earn at least $25/hr" instead of the legalistic original). The combined effect in the follow-up: ITT rose from 1.85 to 2.67 hrs/wk, purely from better compliance.

Intuit "Talk to an Expert" Prompt β€” Separating the Nudge from the Service

Business Context: TurboTax experimented with showing a "Talk to an Expert β€” Free 5-Minute Consultation" prompt at the deductions section (the point of highest abandonment). The experiment randomly assigned 200K users to see the prompt (Z = 1) or not. Of those who saw it, only 24% clicked through and completed a consultation (T = 1). Leadership wants to know two things: (1) What's the value of the prompt itself (ITT)? (2) What's the value of the actual expert consultation (CACE)? These answer different business questions β€” the prompt cost is nearly zero, but expert staffing costs $35/session.

Data Setup: Z: randomly assigned to see prompt (100K treatment, 100K control). T: actually completed a consultation (24K in treatment, ~200 in control via organic discovery β€” near-zero, treated as 0). Y: filing completion (binary). Compliance rate: ITT_T = 0.24.

Results: ITT on completion: +2.1pp (p = 0.001). Showing the prompt to everyone increases completion by 2.1pp, regardless of whether users click through. This includes a "reassurance effect" β€” just knowing help is available may reduce anxiety. CACE = 2.1 / 0.24 = 8.75pp. Among users who would actually consult an expert if prompted (compliers), the consultation increases completion by ~9pp.

Business Implications: The ITT (2.1pp) justifies showing the prompt to all users β€” it's free and lifts completion. The CACE (8.75pp) feeds the expert staffing ROI model: 8.75pp Γ— $120 average revenue = $10.50 incremental revenue per consultation vs. $35 cost. That's negative ROI on completion alone! But including downstream retention (expert-consulted users return at 15pp higher rates) and word-of-mouth makes it +ROI over 2-year LTV. The CACE analysis prevented the team from either: (a) killing the program based on session-level ROI, or (b) over-scaling it based on the inflated per-protocol estimate of +18pp.

Python Example

# CACE β€” Complier Average Causal Effect
import numpy as np

def cace_estimate(assignment, treatment_taken, outcome):
    """Wald estimator for CACE under one-sided non-compliance."""
    # ITT: effect of assignment on outcome
    itt = outcome[assignment == 1].mean() - outcome[assignment == 0].mean()

    # First stage: effect of assignment on actual treatment
    compliance = (treatment_taken[assignment == 1].mean()
                  - treatment_taken[assignment == 0].mean())

    cace = itt / compliance
    return {'ITT': itt, 'compliance_rate': compliance, 'CACE': cace}

# Example: Intuit live expert prompt experiment
np.random.seed(42)
n = 6000
Z = np.random.binomial(1, 0.5, n)  # Random assignment to see prompt
# Only 25% of assigned users click through
T = Z * np.random.binomial(1, 0.25, n)  # One-sided non-compliance
# True effect of actually talking to expert: +8pp on completion
Y = np.random.binomial(1, np.clip(0.65 + 0.08 * T, 0, 1))

result = cace_estimate(Z, T, Y)
print(f"ITT (intent-to-treat): {result['ITT']:.4f}")
print(f"Compliance rate:       {result['compliance_rate']:.4f}")
print(f"CACE (compliers):      {result['CACE']:.4f}")  # ~0.08