All techniques from Uber's causal inference and mediation modeling blogs β with interactive examples applied at Uber and Intuit.
| Technique | Context | Article |
|---|---|---|
| CUPED (Variance Reduction) | Randomized Experiments | Causal Inference at Uber |
| Propensity Score Matching | Observational | Both articles |
| Inverse Probability Weighting (IPW) | Observational / Experiments | Causal Inference at Uber |
| Regression Discontinuity Design (RDD) | Observational (quasi-experiment) | Causal Inference at Uber |
| Difference-in-Differences (DiD) | Observational (panel) | Causal Inference at Uber |
| Synthetic Control | Observational (aggregate) | Causal Inference at Uber |
| Instrumental Variables (IV) | Observational (endogeneity) | Causal Inference at Uber |
| Mediation Analysis (ACME / ADE) | Mechanism decomposition | Mediation Modeling at Uber |
| Uplift Modeling / HTE | Heterogeneous effects | Causal Inference at Uber |
| Bayesian Structural Time Series (BSTS) | Time series counterfactual | Causal Inference at Uber |
| CACE (Complier Average Causal Effect) | Non-compliance in experiments | Causal Inference at Uber |
When you can randomize: CUPED for variance reduction, CACE for non-compliance, Uplift for heterogeneous treatment effects.
When you cannot randomize: PSM, IPW, RDD, DiD, Synthetic Control, IV β each handles a different source of confounding.
When you need to know why: Mediation analysis decomposes total effects into direct and indirect pathways.
When treatment is aggregate: BSTS with synthetic controls builds counterfactual time series to estimate impact.
Reduce variance in A/B tests by leveraging pre-experiment data, so you can detect smaller effects faster.
CUPED is a control variate method borrowed from Monte Carlo simulation. The key insight: if you have a covariate X correlated with your outcome Y, you can construct an adjusted outcome ΕΆ = Y β ΞΈ(X β E[X]) that has lower variance than Y while preserving the same expectation (since E[X β E[X]] = 0, the adjustment is mean-zero).
The optimal ΞΈ = Cov(Y,X)/Var(X) minimizes Var(ΕΆ). The resulting variance reduction is exactly ΟΒ², where Ο is the Pearson correlation between X and Y. This is equivalent to regressing Y on X and using the residuals β the treatment effect estimate from comparing CUPED-adjusted means is identical to the coefficient on the treatment indicator in a regression of Y on [Treatment, X].
Why it works for causal inference: X is measured pre-randomization, so it's independent of treatment assignment. The adjustment doesn't introduce bias β it only removes predictable variation in Y, leaving the treatment signal cleaner.
Imagine you're comparing test scores between two classrooms after a new teaching method. Some students are naturally A students, others are C students. This natural variation makes it hard to see a small teaching improvement.
CUPED's trick: Before the experiment, you already know each student's GPA (the pre-experiment covariate). Instead of looking at raw test scores, you look at how much better or worse each student did compared to what you'd predict from their GPA. An A student scoring 92 is "normal"; an A student scoring 88 is "unusually low." A C student scoring 78 is "surprisingly good."
By stripping out the predictable part ("of course the A student scored high"), you're left with only the surprise β and the treatment effect shows up much more clearly in those surprises. The better you can predict baseline performance (higher correlation), the more noise you strip away.
The punchline: CUPED doesn't change what you measure or who you test. It just says "don't be impressed that good students scored high β tell me if they scored higher than expected."
Business Context: Uber's marketplace team is testing a new dispatch algorithm ("Smart Match v2") that considers driver-rider personality compatibility scores alongside ETA. The primary metric is ride completion rate. The challenge: ride completion varies enormously across rider segments β a power commuter completes 98% of rides, while a casual weekend user might cancel 30%. This heterogeneity inflates variance and makes it hard to detect the expected ~0.3pp improvement.
Data Setup: Pre-experiment covariate X = each rider's rolling 60-day ride completion rate (measured before randomization). Post-experiment outcome Y = completion rate during the 2-week test. The correlation between X and Y is Ο β 0.82.
Methodology: For each rider, compute ΕΆ_cv = Y β ΞΈ(X β XΜ), where ΞΈ = Cov(Y,X)/Var(X). This strips out the "predictable" part of each rider's behavior, leaving only the experiment-driven variation. With Ο = 0.82, variance reduction = ΟΒ² = 67%.
Results: Raw analysis: ATE = 0.31pp, p = 0.18 (not significant at n = 400K after 2 weeks). CUPED-adjusted: ATE = 0.29pp, SE drops from 0.23 to 0.13, p = 0.026. The experiment reaches significance without extending the test or increasing traffic allocation.
Takeaway: CUPED didn't change the point estimate β it shrank the confidence interval. Without it, the team would have needed to run the test 3x longer or ship based on a directional-but-insignificant result. This is critical in Uber's marketplace where longer experiments risk marketplace contamination.
Business Context: TurboTax is testing a redesigned W-2 import flow that uses OCR + auto-fill instead of manual entry. The primary metric is filing completion rate. Challenge: filing completion is heavily driven by user tax complexity β a simple single-W-2 filer completes at 85%, while a multi-income household with investments completes at 45%. This creates massive variance in the outcome metric.
Data Setup: Pre-experiment covariate X = each user's prior-year filing completion status (binary: completed vs. abandoned) plus number of forms filed last year. These are measured before the experiment, so they're immune to treatment contamination. Ο(X, Y) β 0.74.
Methodology: Extend CUPED to multiple covariates: ΕΆ_cv = Y β Ξ'(X β XΜ), where Ξ is fit via OLS of Y on X in the control group. The multi-variate version captures both the "did they finish last year" signal and the "how complex is their return" signal.
Results: Raw: ATE = 1.8pp, SE = 1.1pp, p = 0.10 (would need 2 more weeks). CUPED-adjusted: ATE = 1.7pp, SE = 0.64pp, p = 0.008. The team ships in week 1 with high confidence, catching the peak of tax season (critical timing β each week of delay costs ~$2M in potential revenue).
Takeaway: In tax, timing is everything. Filing season is ~12 weeks long. Running a 3-week experiment means shipping a winning feature for only 9 weeks. CUPED compressed the test to 1 week, giving 2 extra weeks of the improved experience during peak filing volume.
# CUPED Variance Reduction import numpy as np import pandas as pd def cuped_adjust(y_post, x_pre): """Adjust post-experiment metric using pre-experiment covariate.""" theta = np.cov(y_post, x_pre)[0,1] / np.var(x_pre) y_adjusted = y_post - theta * (x_pre - np.mean(x_pre)) return y_adjusted # Example: TurboTax filing completion experiment np.random.seed(42) n = 10000 # Pre-experiment: prior year filing progress (0-100%) x_pre = np.random.normal(65, 20, n) # Post-experiment: current year completion (correlated with pre) treatment = np.random.binomial(1, 0.5, n) true_effect = 2.0 # True ATE = 2 percentage points y_post = 0.7 * x_pre + treatment * true_effect + np.random.normal(0, 15, n) # Standard estimate (high variance) ate_raw = y_post[treatment==1].mean() - y_post[treatment==0].mean() # CUPED-adjusted estimate (lower variance) y_adj = cuped_adjust(y_post, x_pre) ate_cuped = y_adj[treatment==1].mean() - y_adj[treatment==0].mean() print(f"Raw ATE: {ate_raw:.3f} (SE: {y_post.std()/np.sqrt(n/2):.3f})") print(f"CUPED ATE: {ate_cuped:.3f} (SE: {y_adj.std()/np.sqrt(n/2):.3f})") print(f"Variance reduction: {1 - y_adj.var()/y_post.var():.1%}")
Match treated and control units based on their probability of receiving treatment, to approximate a randomized experiment from observational data.
The fundamental problem: in observational data, treatment assignment depends on covariates X (confounders). The propensity score theorem (Rosenbaum & Rubin, 1983) states that if treatment assignment is strongly ignorable given X β i.e., (Y(0), Y(1)) β₯ T | X β then it's also ignorable given the scalar e(X) = P(T=1|X).
This is a dimensionality reduction result: instead of matching on 20 covariates (curse of dimensionality), you match on a single number. The propensity score is a balancing score β within strata of e(X), the distribution of X is the same for treated and control units. Matching on e(X) thus creates approximate covariate balance.
After matching, you estimate the ATT (Average Treatment Effect on the Treated) by comparing matched outcomes. The quality of the estimate depends on: (1) correct specification of the propensity model, (2) sufficient overlap (common support), and (3) no unmeasured confounders. Sensitivity analysis (Rosenbaum bounds) quantifies how strong an unmeasured confounder would need to be to invalidate the results.
Think of it as finding your twin. You want to know if a drug works, but you can't run a trial. Some people took the drug (treatment) and some didn't (control). The problem: the people who took the drug are different β maybe they're sicker, older, or more health-conscious.
Step 1: For each person, compute a single number: "How likely was this person to take the drug, given their characteristics?" A 60-year-old diabetic might have a 70% chance; a healthy 25-year-old might have a 5% chance. That's the propensity score.
Step 2: Now pair up each drug-taker with a non-drug-taker who had the same likelihood of taking the drug. A 70%-propensity person who DID take it gets matched with a 70%-propensity person who DIDN'T. These "twins" are similar in all the ways that matter for the treatment decision.
Step 3: Compare outcomes within each pair. The average difference is your causal estimate. The intuition: if two people were equally likely to take the drug but only one did, the difference in their outcomes is more plausibly caused by the drug (not by their background characteristics).
Business Context: Uber Eats product leadership wants to quantify exactly how much a late delivery costs in customer lifetime value. They can't run an experiment (you can't randomly delay people's food). But natural delays happen constantly due to restaurant prep time variance, driver availability, and traffic. The question: does a 15-minute delay cause lower retention, or do low-retention customers just tend to order during high-delay periods (Friday dinner rush)?
Data Setup: 200K orders from Q3. Treatment: order delivered β₯15 min late vs. on-time. Covariates for propensity model: (1) user's historical order frequency, (2) cuisine category, (3) order time (hour + day-of-week), (4) restaurant-to-customer distance, (5) current marketplace utilization rate, (6) user tenure on platform, (7) average prior order value. Outcome: binary 30-day reorder.
Methodology: Fit logistic regression: P(delayed | covariates). Verify overlap β check that propensity score distributions for delayed and non-delayed orders have substantial common support (they do, between 0.08β0.65). Use 1:1 nearest-neighbor matching without replacement on the logit of the propensity score (caliper = 0.2 SD). Post-matching, verify covariate balance: all standardized mean differences < 0.05.
Results: Naive comparison: delayed orders have 11pp lower 30-day reorder (confounded β Friday rush orders are both more delayed AND from less loyal "try it once" users). After PSM: ATT = β4.1pp (95% CI: [β5.3, β2.9]). The confounding explained 7pp of the naive gap. At Uber Eats' scale, this 4pp translates to ~$180M/year in lost LTV.
Takeaway: The PSM result gave the operations team a concrete dollar figure to justify investing in delivery time reduction. It also revealed that the effect is non-linear: delays of 5β10 min have negligible impact, but the retention cliff is steep beyond 15 min β informing their SLA threshold design.
Business Context: TurboTax Live connects filers with CPAs/EAs for real-time help. Users who engage with Live complete at 89% vs. 71% for DIY-only. But there's massive self-selection: users who seek expert help tend to be more motivated, have more complex returns (meaning more at stake), and have higher income. The product team needs the causal effect to justify the cost of expert staffing (~$35/session).
Data Setup: 150K users in the TurboTax Live eligible population from TY24. Treatment: engaged with a live expert (at least one session). Covariates: (1) prior-year product tier, (2) number of tax forms, (3) AGI bracket, (4) filing status, (5) entry point (organic vs. paid), (6) time spent before first expert interaction, (7) state complexity score, (8) mobile vs. desktop.
Methodology: Propensity model: gradient-boosted classifier (logistic regression was insufficient due to non-linear interactions between AGI and form count). Match on propensity score with caliper = 0.1 SD. 1:3 matching (each Live user matched to 3 DIY users) to preserve power. Post-matching balance check: all standardized differences < 0.03. Sensitivity analysis via Rosenbaum bounds: results hold up to Ξ = 1.8 (an unmeasured confounder would need to increase odds of treatment by 80% to explain away the effect).
Results: Naive gap: 18pp. After PSM: ATT = 8.2pp (95% CI: [6.1, 10.3]). Self-selection explains ~10pp of the raw gap. At an 8pp completion lift and $120 average revenue per completed return, each $35 expert session generates $9.60 in incremental revenue β a 27% ROI before accounting for downstream retention and upsell.
Takeaway: Without PSM, leadership would have valued Live sessions at $21.60/session (based on naive 18pp), over-investing in expert staffing. The corrected $9.60 figure is still positive ROI but changes the staffing model: prioritize expert availability for high-complexity filers (where the CATE is highest) rather than blanket availability.
# Propensity Score Matching from sklearn.linear_model import LogisticRegression from sklearn.neighbors import NearestNeighbors import numpy as np, pandas as pd def propensity_score_match(df, covariates, treatment_col, outcome_col, n_neighbors=1): """Match treated to control units on propensity score.""" # Step 1: Estimate propensity scores lr = LogisticRegression(max_iter=1000) lr.fit(df[covariates], df[treatment_col]) df['ps'] = lr.predict_proba(df[covariates])[:, 1] # Step 2: Match treated to nearest control treated = df[df[treatment_col] == 1] control = df[df[treatment_col] == 0] nn = NearestNeighbors(n_neighbors=n_neighbors, metric='euclidean') nn.fit(control[['ps']]) distances, indices = nn.kneighbors(treated[['ps']]) matched_control = control.iloc[indices.flatten()] # Step 3: Estimate ATT att = treated[outcome_col].mean() - matched_control[outcome_col].mean() return att, df['ps'] # Example: Uber Eats delivery delay β reorder rate np.random.seed(42) n = 5000 df = pd.DataFrame({ 'order_freq': np.random.poisson(8, n), 'avg_distance': np.random.exponential(3, n), 'peak_hour': np.random.binomial(1, 0.3, n), }) # Delay is more likely for distant, peak-hour orders (confounded) logit = -1 + 0.15*df['avg_distance'] + 0.8*df['peak_hour'] df['delayed'] = np.random.binomial(1, 1/(1+np.exp(-logit))) # Outcome: 30-day reorder (delay has true -4pp effect) df['reorder'] = np.random.binomial(1, np.clip( 0.6 + 0.02*df['order_freq'] - 0.04*df['delayed'], 0, 1)) att, ps = propensity_score_match( df, ['order_freq', 'avg_distance', 'peak_hour'], 'delayed', 'reorder') print(f"Estimated ATT: {att:.3f}") # Should be close to -0.04
Re-weight observations to create a pseudo-population where treatment is independent of confounders. No need to discard unmatched units.
IPW creates a pseudo-population where treatment is independent of confounders. The Horvitz-Thompson estimator re-weights each observation: treated units get weight 1/e(X), control units get 1/(1βe(X)). This is equivalent to creating a survey-style weighted sample where the "sampling probability" is the probability of the treatment actually received.
Mathematically: E[TY/e(X)] = E[E[TY/e(X)|X]] = E[Y(1)Β·P(T=1|X)/e(X)] = E[Y(1)]. The propensity score in the denominator cancels the selection bias in the numerator. Stabilized weights (multiply by P(T)/e(X) instead of 1/e(X)) reduce variance without introducing bias. Weight trimming at extreme propensity scores (e.g., <0.05 or >0.95) trades small bias for large variance reduction.
Key advantage over matching: IPW uses ALL observations (no discarding), and the estimand is the ATE (not just ATT). It's also more naturally combined with regression (the "doubly robust" estimator: correct if EITHER the propensity model OR the outcome model is correct).
Imagine a university studying whether study groups help exam scores. Students self-select: 90% of pre-med students join groups, but only 20% of art students do. If you naively compare group vs. no-group, you're mostly comparing pre-med (group) to art (no-group) students.
IPW's fix: Give rare observations more weight. That art student who DID join a group? They're unusual and informative β weight them more heavily (1/0.20 = 5x). That pre-med student who joined? Expected and redundant β weight them less (1/0.90 = 1.1x). For the control side: the pre-med who DIDN'T join gets heavy weight (1/0.10 = 10x β they're rare and informative) while the art student who didn't join gets low weight (1/0.80 = 1.25x).
The effect: After re-weighting, the "group" and "no-group" populations look identical in composition β same mix of pre-med and art students. Any remaining difference in outcomes can be attributed to the group itself, not to who chose to join.
vs. Matching: Matching throws away unmatched people. IPW keeps everyone but turns the volume up on informative observations and down on redundant ones. It's like equalizing a survey where some groups are oversampled.
Business Context: Uber's pricing team needs to understand how surge multipliers affect ride request probability. They can't randomize surge (it's set by real-time supply/demand), and naive comparisons are hopelessly confounded: surge happens precisely when demand is already high, making it look like surge increases demand.
Data Setup: 500K ride request opportunities (app opens where a price was shown) across 30 cities over 8 weeks. Treatment: surge β₯ 1.5x. Covariates: (1) metro area, (2) hour-of-day Γ day-of-week, (3) weather conditions, (4) active drivers within 5 min, (5) local event flags (concerts, sports), (6) historical demand for that zone-hour. Outcome: binary ride request within 5 minutes.
Why IPW over PSM: PSM would discard ~40% of observations (poor overlap in extreme demand periods). IPW retains all observations, weighted to represent the population that could have plausibly experienced either surge or no-surge. Stabilized weights are used: w = P(T) / P(T|X) to reduce variance from extreme weights.
Results: Naive: surge zones show +5% ride requests (confounded β high demand drives both surge and requests). IPW-adjusted: surge β₯ 1.5x causes β18.3% (95% CI: [β21.1, β15.5]) reduction in ride requests. The pricing team uses this elasticity curve to optimize the surge multiplier schedule β finding that the revenue-maximizing surge is 1.3x, not the 1.8x the algorithm was frequently setting.
Business Context: TurboTax ran an experiment on a new "Tax Timeline" dashboard. After 2 weeks, the experimentation platform flagged a sample ratio mismatch: treatment had 8% fewer users than expected. Investigation revealed that the new dashboard loaded 200ms slower on older Android devices, causing those users to bounce before the assignment event fired. Re-running would waste 2 weeks of peak tax season.
Data Setup: 300K users assigned (50/50). Treatment retained 138K; control retained 150K. The 12K "missing" treatment users are disproportionately: older Android, lower bandwidth, first-time filers. Covariates for dropout model: device age, OS version, connection speed, prior-year filing status, session start time.
Methodology: Model P(retained in sample | assigned to treatment, covariates) using logistic regression. Weight each remaining treatment user by 1/P(retained). This up-weights users similar to the ones who dropped (old Android, slow connection), reconstructing what the treatment group would have looked like without the telemetry gap. Verify: weighted covariate distributions match control group.
Results: Unweighted (biased): ATE = +3.1pp completion lift (inflated β the dropped users were harder-to-convert). IPW-adjusted: ATE = +1.9pp (95% CI: [0.8, 3.0]). Still significant and shippable, but the correct effect size changes the projected revenue impact from $8M to $5M β affecting the prioritization of the fix for the performance regression.
# Inverse Probability Weighting import numpy as np from sklearn.linear_model import LogisticRegression def ipw_ate(y, treatment, X, clip_bounds=(0.05, 0.95)): """Estimate ATE using stabilized IPW.""" lr = LogisticRegression(max_iter=1000) lr.fit(X, treatment) ps = lr.predict_proba(X)[:, 1] ps = np.clip(ps, *clip_bounds) # Trim extreme weights # Horvitz-Thompson estimator w1 = treatment / ps w0 = (1 - treatment) / (1 - ps) ate = np.mean(w1 * y) - np.mean(w0 * y) # Stabilized weights (less variance) p_treat = treatment.mean() sw1 = treatment * p_treat / ps sw0 = (1 - treatment) * (1 - p_treat) / (1 - ps) ate_stab = (np.sum(sw1 * y) / np.sum(sw1)) - (np.sum(sw0 * y) / np.sum(sw0)) return {'ate': ate, 'ate_stabilized': ate_stab, 'ps': ps} # Example: Surge pricing effect on ride requests np.random.seed(42) n = 8000 demand_level = np.random.normal(0, 1, n) # Confounder surge = np.random.binomial(1, 1/(1+np.exp(-demand_level))) rides = 50 + 10*demand_level - 9*surge + np.random.normal(0, 5, n) result = ipw_ate(rides, surge, demand_level.reshape(-1,1)) print(f"IPW ATE: {result['ate']:.2f}") # True: -9 print(f"Naive diff: {rides[surge==1].mean()-rides[surge==0].mean():.2f}") # Biased
Exploit a cutoff rule: units just above vs. just below a threshold are near-identical, creating a natural experiment at the boundary.
RDD exploits a known assignment rule: treatment is determined by whether a "running variable" X crosses a cutoff c. The identifying assumption is that all potential confounders are continuous at the cutoff β there's no reason someone at X = cβΞ΅ should be different from someone at X = c+Ξ΅ in any way except their treatment status.
Formally: E[Y(0)|X=c] and E[Y(1)|X=c] are identified by the left and right limits of E[Y|X] at c. Any discontinuity in the outcome at c is attributed to the treatment. Estimation uses local linear regression within a bandwidth h of the cutoff, fitting separate slopes on each side. Bandwidth selection (Imbens-Kalyanaraman or Calonico-Cattaneo-Titiunik) balances bias (narrower = less bias) vs. variance (wider = more data).
Validity checks: (1) McCrary test: the density of X should be continuous at c (no bunching = no manipulation). (2) Covariate smoothness: other observable characteristics should not jump at c. (3) Robustness to bandwidth choice. The estimand is the LATE at the cutoff β it's only valid for units near the boundary, not the full population.
Think of a drinking age law. On your 20th birthday, you can't legally drink. On your 21st birthday, you can. A person who is 20 years and 364 days old is essentially identical to a person who is 21 years and 1 day old β same maturity, same life circumstances. The only difference is the legal cutoff.
If you see a sudden jump in, say, car accidents right at age 21, you can attribute it to legal drinking access β because nothing else changed discontinuously at that exact birthday.
In business terms: Any time there's a rule that says "if your score/value/metric crosses X, you get different treatment" β and people can't precisely control their score β you have a natural experiment at the boundary. People just barely qualifying vs. just barely not qualifying are like a randomized experiment, but nature (or the algorithm) did the randomizing for you.
The limitation: You only learn the effect at the boundary. The effect of surge pricing at the 2.0x threshold tells you about riders experiencing ~2.0x surge β it may not generalize to 3.0x surge. You're looking through a narrow window, but what you see through it is very credible.
Business Context: Uber's pricing algorithm sets surge multipliers as a continuous function of local supply/demand. But the displayed price jumps at round numbers (1.5x, 2.0x, 2.5x). The behavioral economics team suspects round-number surge multipliers create a psychological "sticker shock" beyond the actual price difference. Testing this with an A/B test would require changing the pricing algorithm β risky in a live marketplace.
Data Setup: Running variable: the continuous demand-supply ratio that determines the surge multiplier. Cutoff: the ratio threshold at which surge crosses 2.0x. 180K ride request opportunities where the underlying ratio was within Β±0.15 of the cutoff (bandwidth selection via Imbens-Kalyanaraman optimal bandwidth). Outcome: P(rider requests ride within 3 minutes of seeing the price).
Key Insight: Riders at 1.95x and 2.05x surge face nearly identical supply/demand conditions β they just happen to fall on different sides of a rounding boundary. This is as-good-as-random assignment near the cutoff.
Results: Local linear regression shows a discontinuous drop of 6.2pp (95% CI: [4.1, 8.3]) in ride request probability at exactly the 2.0x boundary. By contrast, the smooth price-demand slope is only -1.8pp per 0.1x surge increment. Conclusion: 4.4pp of the drop is pure "round number" framing effect. This led Uber to test "1.9x" labeling for surges in the 1.9β2.1 range, recovering ~$12M/quarter in rides.
Business Context: TurboTax Free Edition is available to filers below a complexity score threshold (based on number of forms, income sources, and deduction types). Filers scoring above the threshold are routed to a paid SKU recommendation. Product wants to know: does the forced upgrade cause abandonment, or would complex filers have struggled anyway?
Data Setup: Running variable: TurboTax complexity score (0β100, continuous). Cutoff: score = 35 (Free Edition eligibility boundary). 400K filers within Β±10 points of the cutoff from TY24. Outcomes: (1) filing completion rate, (2) paid conversion rate, (3) customer satisfaction score (post-filing NPS).
Methodology: Local linear regression within the bandwidth, allowing different slopes on each side. McCrary density test confirms no bunching at the cutoff (users can't precisely manipulate their complexity score). Robustness: results hold across bandwidths of 5, 8, 10, and 15.
Results: At the cutoff, being nudged to a paid SKU causes: (1) β7.3pp filing completion (users who see a price wall after starting for free abandon at higher rates), (2) +22pp paid conversion among those who continue, (3) β12 NPS points. Net revenue impact: positive ($4.80 incremental revenue per filer at the cutoff), but the NPS hit is a long-term retention risk. This informed the design of a "soft upgrade" flow that previews paid features before the paywall.
# Regression Discontinuity Design import numpy as np from sklearn.linear_model import LinearRegression def rdd_estimate(running_var, outcome, cutoff, bandwidth): """Local linear RDD estimator.""" mask = np.abs(running_var - cutoff) <= bandwidth X_local = running_var[mask] Y_local = outcome[mask] T_local = (X_local >= cutoff).astype(float) X_centered = X_local - cutoff # Y = Ξ± + ΟΒ·T + Ξ²βΒ·X + Ξ²βΒ·TΒ·X + Ξ΅ design = np.column_stack([T_local, X_centered, T_local * X_centered]) reg = LinearRegression().fit(design, Y_local) tau = reg.coef_[0] # coefficient on T is the RDD effect return tau # Example: Uber surge threshold at multiplier = 2.0 np.random.seed(42) n = 2000 surge_mult = np.random.uniform(1.5, 2.5, n) # Running variable cutoff = 2.0 treated = (surge_mult >= cutoff).astype(float) # Ride request probability drops by 12pp at the 2x threshold ride_request = 0.7 - 0.1*(surge_mult - 1.5) - 0.12*treated + np.random.normal(0, 0.08, n) tau = rdd_estimate(surge_mult, ride_request, cutoff, bandwidth=0.3) print(f"RDD effect at cutoff: {tau:.3f}") # True: -0.12
Compare the change over time in a treated group vs. a control group. The "difference of differences" removes time-invariant confounders.
DiD is a panel data method that controls for time-invariant unobserved confounders by differencing. The first difference (within-group, over time) removes unit-level fixed effects. The second difference (between groups) removes common time shocks. What remains is the treatment effect.
The regression form: Y_it = Ξ± + Ξ²βΒ·Treat_i + Ξ²βΒ·Post_t + ΟΒ·Treat_iΓPost_t + Ξ΅_it. The coefficient Ο on the interaction term is the DiD estimator. With panel data, include unit and time fixed effects: Y_it = Ξ±_i + Ξ³_t + ΟΒ·D_it + Ξ΅_it, where D_it = 1 if unit i is treated at time t. Cluster standard errors at the unit level (the level of treatment assignment).
Parallel trends: The identifying assumption. In potential outcomes: E[Y(0)_T,post β Y(0)_T,pre] = E[Y(0)_C,post β Y(0)_C,pre]. It's untestable for the post-period (you can't observe the treated group's counterfactual), but you can check pre-treatment trends for parallelism. Event-study specifications test for pre-trends and visualize dynamic treatment effects.
Imagine a restaurant chain testing a new menu in 5 locations. Sales go up 20% in those 5 stores. Is that the new menu? Maybe β but it's also December, and ALL stores see holiday sales bumps.
DiD's trick: Look at the 15 stores that DIDN'T get the new menu. They went up 12% (the holiday effect). So the new menu's actual effect is 20% β 12% = 8%. You subtracted out the common trend that would have happened regardless.
The key assumption (parallel trends): Without the new menu, the 5 test stores would have followed the same 12% trajectory as the others. This is plausible if the stores were trending similarly before the change. If the test stores were already growing faster (maybe they're in booming neighborhoods), DiD over-attributes that growth to the menu.
When it's powerful: DiD shines when you have "before and after" data for both groups. It handles the biggest worry in observational studies β that the treatment and control groups are fundamentally different β by saying "I don't care if they're different in levels, only that they would have changed at the same rate."
Business Context: Uber redesigned its driver onboarding flow (streamlined document upload, in-app vehicle inspection scheduling, guaranteed first-week earnings). The rollout was staggered: Phase 1 cities (Austin, Nashville, Portland, Raleigh, Salt Lake City) launched in March; Phase 2 cities would launch in June. The 15 Phase-2 cities serve as the control group. The key question: does the new onboarding causally increase the 30-day driver activation rate (first trip completed within 30 days of signup)?
Data Setup: Monthly driver activation rates for 20 cities, spanning 12 months before launch (JanβDec) and 4 months after (JanβApr). Outcome: 30-day activation rate. Pre-treatment: 12 months of data to verify parallel trends.
Parallel Trends Verification: Plot pre-treatment trends for treated vs. control cities. They track closely (both rising ~0.8pp/month due to seasonal labor market tightening). Formal test: interaction of treatment-group Γ time in pre-period has no significant coefficients (p = 0.62). Also run a placebo test: pretend the treatment happened 6 months earlier β the "effect" is 0.1pp and insignificant.
Results: DiD regression with city and time fixed effects, clustered SEs at the city level: Ο = +14.8pp (95% CI: [10.2, 19.4]). The new onboarding flow nearly doubled the activation rate from a baseline of 18% to 33%. Staggered DiD (Callaway-Sant'Anna) confirms no dynamic treatment effect heterogeneity β the effect is immediate and stable.
Takeaway: This analysis was presented to the board as justification for accelerating the Phase 2 rollout. The guaranteed earnings component alone was estimated to account for ~60% of the effect (via a follow-up mediation analysis). Each incremental activated driver generates ~$1,200/month in gross bookings.
Business Context: California introduced a new refundable child tax credit ("CalKids") in TY24. Intuit's government affairs team wants to quantify whether this policy change causally increased TurboTax adoption in California, since TurboTax prominently supported the new credit in its interview flow. The business case: if state-specific tax law support drives adoption, it justifies investing engineering resources in rapid state conformity updates.
Data Setup: Treatment: California (new credit). Control: 10 comparable states without significant TY24 tax law changes (selected by: population size, prior TurboTax market share, income distribution, and filing season cadence). Outcome: weekly new TurboTax starts per capita. Pre-period: TY22βTY23 (2 years of weekly data). Post-period: TY24 filing season (JanβApr).
Methodology: Two-way fixed effects DiD with state and week fixed effects. Cluster SEs at state level (only 11 clusters, so use wild cluster bootstrap for valid inference). Event-study specification: estimate week-by-week treatment effects to visualize the timing of the effect β expect it to appear when TurboTax's CalKids marketing campaign launched (late January).
Results: Event study shows: zero effect in pre-period (parallel trends validated), a sharp +6.2% lift in TurboTax starts starting the week the CalKids campaign launched, growing to +9.1% by late March. Overall DiD: +7.4% (95% CI: [4.8, 10.0]) more new TurboTax starts per capita in California vs. control states. This translates to ~85K incremental new customers and ~$6.8M in revenue attributable to the rapid CalKids support.
Takeaway: The ROI on the CalKids engineering sprint (3 engineers Γ 4 weeks = ~$120K cost) was 57:1. This created an internal playbook: for every major state tax law change, auto-prioritize a "first-mover" TurboTax integration to capture adoption lift.
# Difference-in-Differences import numpy as np import statsmodels.api as sm # Example: Uber new onboarding in treated cities np.random.seed(42) n_cities, n_periods = 20, 12 treated_cities = np.array([1]*5 + [0]*15) # 5 treated, 15 control treatment_period = 6 data = [] for city in range(n_cities): baseline = np.random.normal(50, 5) for t in range(n_periods): post = int(t >= treatment_period) treat = treated_cities[city] # Common trend + treatment effect of +15pp for treated post-period y = baseline + 2*t + 15*treat*post + np.random.normal(0, 3) data.append({'city': city, 't': t, 'treat': treat, 'post': post, 'y': y}) import pandas as pd df = pd.DataFrame(data) # DiD regression: Y = Ξ²β + Ξ²βΒ·Treat + Ξ²βΒ·Post + Ξ²βΒ·TreatΓPost + Ξ΅ df['treat_post'] = df['treat'] * df['post'] X = sm.add_constant(df[['treat', 'post', 'treat_post']]) model = sm.OLS(df['y'], X).fit(cov_type='cluster', cov_kwds={'groups': df['city']}) print(model.summary().tables[1]) # treat_post coefficient β 15 (the causal DiD effect)
When you treat one unit (city, market), construct a weighted combination of untreated units that best matches the treated unit's pre-treatment trajectory. The gap post-treatment is the causal effect.
Synthetic control solves a constrained optimization: find non-negative weights wβ...wJ (summing to 1) over J donor units that minimize the pre-treatment MSPE: Ξ£_t (Yβt β Ξ£β±Ό wβ±ΌYβ±Όt)Β² for t in pre-period. This is a convex program with a unique solution. The resulting weighted combination is the "synthetic" unit β a data-driven counterfactual.
Inference is non-standard (n=1 treated unit). The standard approach is permutation inference (placebo tests): apply the method to every donor unit as if IT were the treated one, compute each placebo's post/pre RMSPE ratio, and see where the treated unit ranks. If it ranks 1st out of 26 units, the p-value is 1/26 β 0.038. This is an exact test that makes no distributional assumptions.
Extensions: Augmented SC (adds an outcome model for bias correction), Penalized SC (regularization prevents overfitting to pre-period noise), and SC + DiD hybrids (re-weight AND de-mean, relaxing the "perfect pre-fit" requirement).
You renovated YOUR house and want to know if it increased your property value. There's no control version of your house. But your neighbors didn't renovate. The problem: no single neighbor's house is a perfect comparison β one is bigger, another is on a busier street, another was recently painted.
Synthetic control's idea: Build a "Frankenstein neighbor" β a weighted mix of actual neighbors that, BEFORE your renovation, had the same price trajectory as your house. Maybe 40% of Neighbor A + 35% of Neighbor B + 25% of Neighbor C tracks your house's price history almost perfectly.
After the renovation, your house's value diverges from this synthetic neighbor's. That gap is the renovation effect. If the synthetic neighbor would have predicted $500K and your house is now $540K, the renovation added ~$40K.
Why it's credible: You're not assuming any single comparison is good β you're letting the data find the best combination. And you can check: does the synthetic version actually track you well in the pre-period? If it doesn't, you know the method won't work here.
Business Context: Uber launched in-app tipping in Austin as a single-market pilot. The goal: measure the causal effect on driver supply (online hours per week). A/B testing is impossible β tipping is a market-level feature that affects all participants. There is no single city that perfectly matches Austin's marketplace dynamics (tech-heavy, college town, unique regulatory history from the 2016 Uber ban).
Data Setup: Treated unit: Austin. Donor pool: 25 mid-size US cities without tipping. Pre-treatment period: 40 weeks. Post-treatment: 12 weeks. Outcome: weekly driver online hours (per 1K population). Matching predictors: average fare, driver-to-rider ratio, weekday/weekend split, temperature, competitor presence (Lyft share), population growth rate.
Methodology: Optimize convex weights over the donor pool to minimize pre-treatment MSPE (mean squared prediction error) for Austin's driver hours series. Result: Synthetic Austin = 0.38 Γ Houston + 0.31 Γ Denver + 0.22 Γ Nashville + 0.09 Γ San Antonio. Pre-treatment fit: RMSPE = 1.2 hours (vs. Austin's mean of 85 hours/week/1K pop). Inference: permutation test β apply the same method to every donor city (pretending each was treated), compute the ratio of post/pre RMSPE. Austin's ratio ranks 1st of 26 (p = 0.038).
Results: Post-tipping, Austin's actual driver hours diverge from synthetic Austin by +5.8 hours/week/1K pop (6.8% increase). The effect appears in week 2 and stabilizes by week 5. Placebo tests: no other city shows a comparable post-period gap. Back-of-envelope: 6.8% more driver supply β 2.1% lower average wait times β estimated $3.4M/year in retained rides for the Austin market alone.
Takeaway: This analysis justified the nationwide tipping rollout. Critically, the synthetic control also revealed that the driver supply increase came from existing drivers working longer hours (intensive margin), not new drivers joining (extensive margin) β informing the driver engagement team's messaging strategy.
Business Context: Intuit's marketing team ran a 6-week TV campaign in the Dallas-Fort Worth DMA during tax season, spending $4.2M on spots during local news and prime time. They need to isolate the causal effect of TV advertising on new TurboTax customer acquisitions, but TV campaigns are DMA-level treatments with no within-DMA control group.
Data Setup: Treated: Dallas DMA. Donor pool: 30 DMAs where Intuit ran no TV ads during this period. Pre-treatment: 20 weeks (prior tax season + early current season). Post-treatment: 6 weeks of campaign + 4 weeks after. Outcome: weekly new TurboTax starts per 100K households. Predictors: median household income, broadband penetration, competitor (H&R Block) store density, prior-year TT market share, Hispanic population % (Dallas is a majority-minority DMA).
Results: Synthetic Dallas = 0.41 Γ Phoenix + 0.33 Γ Atlanta + 0.18 Γ Charlotte + 0.08 Γ Tampa. Pre-treatment RMSPE = 2.3 starts/100K (excellent fit). During the 6-week campaign, Dallas outperforms its synthetic by an average of 14.2 starts/100K/week (+11.8%). In the 4 weeks post-campaign, the gap shrinks to +4.1 starts/100K/week (a "decay" effect suggesting ad recall fading). Total incremental starts: ~38K. At $110 average revenue per new customer: $4.18M revenue vs. $4.2M spend β essentially breakeven on first-year revenue, but with positive LTV accounting for multi-year retention.
Takeaway: The synthetic control analysis changed the marketing team's strategy: TV campaigns are breakeven in year 1 but profitable over 3-year LTV. More importantly, the post-campaign decay curve informed optimal flight scheduling β the team shifted from one 6-week burst to two 3-week flights with a 2-week gap, increasing sustained awareness while maintaining the same budget.
# Synthetic Control Method import numpy as np from scipy.optimize import minimize def synthetic_control(treated_pre, control_pre, control_post): """Find weights that best match treated unit's pre-treatment trajectory.""" n_controls = control_pre.shape[0] def objective(w): synth = w @ control_pre return np.sum((treated_pre - synth)**2) # Constraints: weights sum to 1, each β₯ 0 constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1} bounds = [(0, 1)] * n_controls w0 = np.ones(n_controls) / n_controls result = minimize(objective, w0, bounds=bounds, constraints=constraints) synth_post = result.x @ control_post return result.x, synth_post # Example: Uber tipping feature in Austin np.random.seed(42) T_pre, T_post = 12, 8 # Austin (treated) β rides per week austin_pre = 100 + np.cumsum(np.random.normal(1, 2, T_pre)) austin_post = austin_pre[-1] + np.cumsum(np.random.normal(1.5, 2, T_post)) + 7 # +7 treatment # Donor pool controls_pre = np.array([ 95 + np.cumsum(np.random.normal(1.1, 2, T_pre)), # Houston 80 + np.cumsum(np.random.normal(0.9, 2.5, T_pre)), # Denver 70 + np.cumsum(np.random.normal(1.0, 1.8, T_pre)), # Nashville ]) controls_post = np.array([ controls_pre[0,-1] + np.cumsum(np.random.normal(1.1, 2, T_post)), controls_pre[1,-1] + np.cumsum(np.random.normal(0.9, 2.5, T_post)), controls_pre[2,-1] + np.cumsum(np.random.normal(1.0, 1.8, T_post)), ]) weights, synth_post = synthetic_control(austin_pre, controls_pre, controls_post) effect = austin_post.mean() - synth_post.mean() print(f"Weights: {dict(zip(['Houston','Denver','Nashville'], weights.round(2)))}") print(f"Estimated effect: {effect:.1f} rides/week") # True: ~7
When treatment is endogenous (confounded in unobservable ways), find an "instrument" β a variable that affects the outcome ONLY through its effect on treatment.
IV addresses endogeneity: when Cov(T, Ξ΅) β 0 in the outcome equation Y = ΟT + Ξ΅ (unmeasured confounders in Ξ΅ correlate with treatment T). OLS is biased. The instrument Z provides exogenous variation in T β variation uncorrelated with Ξ΅.
2SLS: Stage 1 regresses T on Z (and covariates) to get TΜ β the "clean" part of treatment driven only by Z. Stage 2 regresses Y on TΜ. The coefficient on TΜ is consistent for the causal effect because TΜ inherits Z's exogeneity. The Wald estimator (for binary Z) simplifies to: Ο = Cov(Y,Z)/Cov(T,Z) = [E(Y|Z=1)βE(Y|Z=0)] / [E(Y|Z=1)βE(Y|Z=0) for T].
Estimand: Under heterogeneous effects, IV estimates the LATE (Local Average Treatment Effect) β the effect for "compliers" (those whose treatment status is shifted by the instrument). This is NOT the ATE. Weak instruments (low first-stage F) cause 2SLS to be biased toward OLS. Rule of thumb: F > 10 (Stock & Yogo). Modern: use Anderson-Rubin confidence sets that are valid even with weak instruments.
You want to know if exercise causes weight loss. But people who exercise also eat better, sleep more, and are generally more health-conscious. You can't separate the effect of exercise from these other factors.
The instrument: Find something that makes people exercise MORE (or LESS) but doesn't directly affect weight through any other channel. Example: living near a gym. People near gyms exercise more (relevance), but living near a gym doesn't directly change your metabolism or diet (exclusion restriction). It only affects your weight by getting you to the gym more often.
The logic: Compare weight outcomes for people near vs. far from gyms. Any weight difference must flow through the exercise channel (since distance doesn't affect weight otherwise). Divide the weight difference by the exercise difference β causal effect of exercise on weight.
The catch: You're only measuring the effect for people whose behavior actually changes based on gym proximity ("compliers"). Gym rats will exercise regardless; couch potatoes won't go even if the gym is next door. IV tells you about the "marginal" people in between β which is often exactly the policy-relevant group.
Business Context: Uber's retention team observes that riders who experience longer wait times have lower 30-day retention. But this is classic endogeneity: impatient riders (who are also more likely to churn for other reasons) tend to request rides during peak hours when wait times are longest. Patience, mood, and lifestyle are unmeasured confounders affecting both the "treatment" (wait time) and the outcome (retention). An experiment that deliberately delays pickups is unethical and would cause immediate rider loss.
The Instrument β Rain: Sudden rainfall reduces driver supply (fewer drivers want to drive in bad weather), which increases wait times. Critically, rain is plausibly exogenous to a rider's underlying loyalty β it doesn't directly make you like or dislike Uber (exclusion restriction). Rain β β Driver Supply β β Wait Time β ? Retention. The only channel through which rain affects retention is via wait time.
Data Setup: 1.2M ride requests across 15 cities over 6 months. Instrument Z: binary indicator for unexpected rainfall (actual precipitation > forecasted by β₯ 0.5 inches, within Β±2 hours of ride request). Treatment T: actual wait time (continuous, minutes). Outcome Y: binary 30-day retention (β₯1 ride in next 30 days). Covariates: city, day-of-week, hour, rider tenure.
First Stage: Rain β Wait Time: coefficient = +3.8 minutes (F-stat = 847, far above the weak instrument threshold of 10). Unexpected rain adds ~4 minutes to average wait times.
Results: Naive OLS: each +1 minute wait β β0.4pp retention (biased toward zero because impatient people both wait less AND churn more). IV/2SLS: each +1 minute wait β β1.2pp retention (3x larger than OLS). The true causal harm of wait time was being masked by confounding. At Uber's scale, reducing average wait times by 1 minute would retain ~180K additional monthly active riders, worth ~$43M/year in gross bookings.
Robustness: Over-identification test using a second instrument (local sports events that pull drivers to stadium areas). Both instruments give consistent estimates (Hansen J p = 0.34). Exclusion restriction argument: rain might affect rider mood β different retention. Sensitivity analysis shows results hold unless rain has a direct effect on retention β₯ 0.8pp (implausible for a single rainy ride).
Business Context: TurboTax Live lets users connect with a CPA. Users who engage complete filing at 89% vs. 71% for non-engagers. But self-selection is severe: motivated users with high-stakes returns both seek help AND try harder to complete. OLS with observed covariates can't fully control for motivation, tax anxiety, or "amount at stake" (partially unobservable).
The Instrument β Queue Wait Time: Expert availability varies quasi-randomly throughout the day based on expert shift schedules, concurrent session load, and timezone coverage gaps. A user who clicks "Talk to Expert" at 2:47 PM PST might face a 2-minute wait; the same user clicking at 2:52 PM might face a 25-minute wait. This variation in queue time affects whether users actually connect (relevance) but is plausibly unrelated to a user's underlying filing ability or motivation (exclusion).
Data Setup: 90K users who clicked the "Talk to Expert" button (intent-to-treat population). Instrument Z: expert queue wait time at the moment of click (minutes). Treatment T: binary, actually completed a consultation (users who face long queues abandon the queue). Outcome Y: filing completion. Covariates: time-of-day, day-of-week, product tier, form complexity.
First Stage: Each +5 min queue wait β β8.3pp probability of completing consultation (F-stat = 312). Users are very sensitive to queue times β 40% abandon after 10 minutes.
Results: OLS: expert consultation β +18pp completion (severely upward biased). IV/2SLS: +11.2pp (95% CI: [7.8, 14.6]). The true causal effect is 11pp β still large, but 40% of the naive estimate was self-selection bias. This recalibrated the ROI model for expert staffing: at $35/session and 11pp incremental completion ($13.20 incremental revenue), the margin is thinner than the OLS-based model suggested. It shifted investment toward reducing queue times (which the first stage showed have a huge effect on uptake) rather than simply hiring more experts.
# Instrumental Variables β 2SLS import numpy as np from sklearn.linear_model import LinearRegression def two_stage_ls(y, treatment, instrument, covariates=None): """Two-Stage Least Squares IV estimator.""" Z = instrument.reshape(-1, 1) if instrument.ndim == 1 else instrument if covariates is not None: Z = np.column_stack([Z, covariates]) # Stage 1: Regress treatment on instrument stage1 = LinearRegression().fit(Z, treatment) t_hat = stage1.predict(Z) first_stage_f = (stage1.score(Z, treatment) * len(y)) / Z.shape[1] # Stage 2: Regress outcome on predicted treatment X2 = t_hat.reshape(-1, 1) if covariates is not None: X2 = np.column_stack([X2, covariates]) stage2 = LinearRegression().fit(X2, y) return {'iv_estimate': stage2.coef_[0], 'first_stage_f': first_stage_f} # Example: Rain β driver supply β wait time β retention np.random.seed(42) n = 5000 rain = np.random.binomial(1, 0.3, n) # Instrument unobs_loyalty = np.random.normal(0, 1, n) # Unobserved confounder wait_time = 5 + 4*rain + 1*unobs_loyalty + np.random.normal(0, 2, n) # Treatment retention = 80 - 1.2*wait_time + 5*unobs_loyalty + np.random.normal(0, 3, n) # Naive OLS (biased β omits loyalty) naive = LinearRegression().fit(wait_time.reshape(-1,1), retention) print(f"Naive OLS: {naive.coef_[0]:.2f}") # Biased toward 0 (confounder) # IV estimate (unbiased) iv = two_stage_ls(retention, wait_time, rain) print(f"IV estimate: {iv['iv_estimate']:.2f}") # Close to -1.2
Decompose a total effect into the indirect effect (through a mediator) and the direct effect (everything else). Answers "WHY did this treatment work?"
Mediation analysis decomposes the total effect into Natural Direct Effect (NDE/ADE) and Natural Indirect Effect (NIE/ACME) using the potential outcomes framework. Define M(t) as the mediator value under treatment t, and Y(t, m) as the outcome under treatment t and mediator m. Then:
NIE = E[Y(1, M(1)) β Y(1, M(0))] β what happens if you "flip" the mediator to its treatment value while holding treatment constant. NDE = E[Y(1, M(0)) β Y(0, M(0))] β the treatment effect if the mediator were held at its control value.
Sequential ignorability (Imai et al.): requires (1) treatment is randomly assigned (or ignorable conditional on covariates), AND (2) the mediator is ignorable conditional on treatment and pre-treatment covariates. Condition (2) is extremely strong β it rules out ANY post-treatment confounders of the mediator-outcome relationship. Sensitivity analysis (varying Ο, the correlation between mediator and outcome error terms) is essential.
In the linear case with no interaction: ACME = Γ’ Γ bΜ (product of coefficients). With interaction: ACME depends on treatment status. Bootstrap CIs are preferred over Sobel's test (which assumes normality of the product).
A new drug reduces blood pressure. Great β but why? Does it relax blood vessels directly? Or does it reduce stress hormones, which then lowers blood pressure? Knowing the "why" matters: if it works entirely through stress hormones, you might find a better (cheaper, fewer side effects) way to target those hormones directly.
Mediation's question: How much of Drug β Blood Pressure goes through Drug β Stress Hormones β Blood Pressure (indirect path) vs. Drug β Blood Pressure directly?
How it works: (1) Measure how much the drug changes stress hormones (path a). (2) Measure how much stress hormones affect blood pressure, controlling for the drug (path b). (3) Multiply: a Γ b = the indirect effect (ACME). (4) Whatever's left over from the total effect is the direct effect (ADE).
In business: Your A/B test won. Mediation tells you which feature of the change drove the win. This is the difference between "the redesign worked, do more redesigns" (vague) and "65% of the win came from the ETA display, invest there specifically" (actionable). It turns experiment results into a product roadmap.
Business Context: Uber's rider app team shipped a major home screen redesign that increased ride request rate by 4.2pp in an A/B test. Leadership is deciding where to allocate the next quarter's engineering resources. The redesign had two major changes: (1) a new real-time ETA display with animated driver icons, and (2) a simplified, one-tap ride request flow. Which change deserves credit? Building more ETA features is a different roadmap than simplifying UX.
Data Setup: 500K users in the A/B test. Treatment T: new home screen (binary). Mediator M: user's perceived wait time, measured via a post-request survey ("How long do you think your pickup will take?" β collected for 30% of rides). Outcome Y: binary ride request within the session. Key: the mediator is measured AFTER treatment assignment but BEFORE the outcome (the survey fires when a user opens the app but before they decide to request).
Methodology: Causal mediation analysis using the potential outcomes framework. Step 1: Regress M on T β the new design reduces perceived wait by 1.8 minutes (path a). Step 2: Regress Y on T and M β each minute of perceived wait reduces request probability by 1.4pp (path b), and the direct effect of the redesign holding perceived wait constant is +1.5pp (path c'). ACME = a Γ b = 1.8 Γ 1.4 = 2.52pp. ADE = 1.5pp. Total = 4.02pp (close to the observed 4.2pp).
Results: 63% of the total effect is mediated through perceived wait time (ACME = 2.5pp), while 37% is the direct effect of the simpler UX (ADE = 1.5pp). Bootstrap 95% CI for proportion mediated: [52%, 74%]. Sensitivity analysis (Imai et al. Ο parameter): results hold unless the correlation between unmeasured confounders of MβY exceeds Ο = 0.35.
Takeaway: The mediation analysis changed the Q2 roadmap. Instead of further UX simplification (which would have addressed only the 37% ADE), the team invested in: (1) more granular ETA predictions (driver-specific models), (2) animated progress indicators, and (3) under-promise/over-deliver ETA calibration. The follow-up experiment on these ETA-focused improvements yielded an additional 3.1pp lift β validating the mediation finding.
Business Context: TurboTax redesigned the tax interview to use plain-language questions instead of tax jargon (e.g., "Did you have a job?" instead of "Do you have W-2 income?"). An A/B test showed a 5.1pp lift in filing completion. But the product team has competing theories for WHY it works: (A) users feel more confident because they understand the questions, leading them to continue; (B) the simpler interface reduces cognitive load, making it faster to finish; (C) fewer jargon terms means fewer users Googling mid-flow and getting lost. Understanding the mechanism determines the next investment.
Data Setup: Treatment T: new plain-language interview (binary). Mediator M: user confidence score β measured via a 3-question in-app micro-survey at the 40% completion mark ("How confident are you that you're entering information correctly?" 1β7 scale). This measurement point is after enough treatment exposure to affect confidence but early enough that many users haven't yet decided to abandon. Outcome Y: filing completion. n = 120K users (60K per arm, 35% survey response rate β 42K with mediator data).
Methodology: Path a: New interview β +0.9 points on confidence scale (p < 0.001). Path b: Each +1 confidence point β +3.1pp completion probability (controlling for treatment). Path c' (ADE): +2.3pp direct effect on completion (holding confidence constant). ACME = 0.9 Γ 3.1 = 2.79pp. ADE = 2.3pp. Total = 5.09pp. Proportion mediated = 55%.
Sensitivity & Robustness: (1) Selection into survey: compare survey responders vs. non-responders on observables β no significant differences. (2) Sequential ignorability: the critical untestable assumption. A post-treatment confounder like "tax situation complexity" could affect both confidence and completion. Include form count and AGI as covariates in the mediator model β ACME drops slightly to 2.4pp (still 47% mediated). (3) Alternative mediator: time-to-40%-completion (cognitive load proxy) mediates an additional 18% of the total effect, partially overlapping with confidence.
Takeaway: Confidence is the dominant mechanism (55%), cognitive load adds another 18%, and the remaining ~27% is through other pathways (possibly the "less Googling" hypothesis, which they couldn't directly measure). The product team's Q2 investment: (1) expand plain-language to the review and filing sections (confidence pathway), (2) add "you're doing great" progress affirmations at key milestones, (3) contextual tooltips that explain why a question is asked (addressing remaining uncertainty). The follow-up experiment on confidence-focused features yielded +2.8pp additional completion lift.
This is the simplest and most widely-used approach. Mediation analysis reduces to 3 ordinary linear regressions β nothing more. Here's exactly how it works, step by step.
# Baron & Kenny Mediation β The Linear Regression Approach # This is the classical method: just 3 regressions + a multiplication import numpy as np import statsmodels.api as sm np.random.seed(42) n = 3000 # ββ Simulate: TurboTax simplified interview β confidence β completion ββ T = np.random.binomial(1, 0.5, n).astype(float) # Treatment: new interview M = 4.0 + 0.9 * T + np.random.normal(0, 1.2, n) # Mediator: confidence (1-7) Y = 0.3 + 0.023 * T + 0.031 * M + np.random.normal(0, 0.15, n) # Outcome: completion # ββ REGRESSION 1: Total Effect (Y ~ T) ββ reg1 = sm.OLS(Y, sm.add_constant(T)).fit() c_total = reg1.params[1] print("βββ Regression 1: Y ~ T (Total Effect) βββ") print(f" c (total effect) = {c_total:.4f}, p = {reg1.pvalues[1]:.4f}") # ββ REGRESSION 2: Treatment β Mediator (M ~ T) ββ reg2 = sm.OLS(M, sm.add_constant(T)).fit() a = reg2.params[1] print(f"\nβββ Regression 2: M ~ T (Treatment β Mediator) βββ") print(f" a = {a:.4f}, p = {reg2.pvalues[1]:.4f}") # ββ REGRESSION 3: Both β Outcome (Y ~ T + M) ββ X3 = sm.add_constant(np.column_stack([T, M])) reg3 = sm.OLS(Y, X3).fit() c_prime = reg3.params[1] # Direct effect (ADE) b = reg3.params[2] # Mediator β Outcome print(f"\nβββ Regression 3: Y ~ T + M (Direct + Mediator) βββ") print(f" c' (direct/ADE) = {c_prime:.4f}, p = {reg3.pvalues[1]:.4f}") print(f" b (M β Y) = {b:.4f}, p = {reg3.pvalues[2]:.4f}") # ββ DECOMPOSITION ββ acme = a * b total = acme + c_prime pct = acme / total * 100 print(f"\nβββ Mediation Decomposition βββ") print(f" Indirect (ACME) = a Γ b = {a:.4f} Γ {b:.4f} = {acme:.4f}") print(f" Direct (ADE) = c' = {c_prime:.4f}") print(f" Total = {total:.4f} (should β c = {c_total:.4f})") print(f" % Mediated = {pct:.1f}%") # ββ BARON & KENNY 4 CONDITIONS ββ print(f"\nβββ Baron & Kenny Checklist βββ") print(f" 1. c significant? {'β' if reg1.pvalues[1] < 0.05 else 'β'} (p={reg1.pvalues[1]:.4f})") print(f" 2. a significant? {'β' if reg2.pvalues[1] < 0.05 else 'β'} (p={reg2.pvalues[1]:.4f})") print(f" 3. b significant? {'β' if reg3.pvalues[2] < 0.05 else 'β'} (p={reg3.pvalues[2]:.4f})") print(f" 4. c' < c? {'β' if abs(c_prime) < abs(c_total) else 'β'} ({c_prime:.4f} < {c_total:.4f})") print(f" β {'Full' if reg3.pvalues[1] > 0.05 else 'Partial'} mediation")
Linear OLS is fine when:
You need something fancier when:
# Causal Mediation Analysis β sklearn version with bootstrap CIs import numpy as np from sklearn.linear_model import LinearRegression def mediation_analysis(treatment, mediator, outcome, n_bootstrap=1000): """Estimate ACME, ADE, and Total Effect with bootstrap CIs.""" n = len(treatment) T, M, Y = treatment.reshape(-1,1), mediator, outcome # Path a: Treatment β Mediator model_m = LinearRegression().fit(T, M) a = model_m.coef_[0] # Paths b and c': [Treatment, Mediator] β Outcome X_full = np.column_stack([treatment, mediator]) model_y = LinearRegression().fit(X_full, Y) c_prime = model_y.coef_[0] # ADE (direct) b = model_y.coef_[1] acme = a * b # Average Causal Mediation Effect (indirect) ade = c_prime # Average Direct Effect total = acme + ade # Bootstrap confidence intervals acme_boot = [] for _ in range(n_bootstrap): idx = np.random.choice(n, n, replace=True) m_b = LinearRegression().fit(T[idx], M[idx]) X_b = np.column_stack([treatment[idx], mediator[idx]]) y_b = LinearRegression().fit(X_b, Y[idx]) acme_boot.append(m_b.coef_[0] * y_b.coef_[1]) ci = np.percentile(acme_boot, [2.5, 97.5]) return {'ACME': acme, 'ADE': ade, 'Total': total, 'Prop_Mediated': acme/total, 'ACME_CI': ci} # Example: Uber app redesign β ETA perception β ride requests np.random.seed(42) n = 3000 treatment = np.random.binomial(1, 0.5, n).astype(float) # Mediator: perceived wait time (lower = better) perceived_wait = 8 - 2*treatment + np.random.normal(0, 2, n) # Outcome: ride request probability ride_request = 0.5 + 0.05*treatment - 0.03*perceived_wait + np.random.normal(0, 0.1, n) result = mediation_analysis(treatment, perceived_wait, ride_request) print(f"ACME (indirect via ETA): {result['ACME']:.4f}") print(f"ADE (direct effect): {result['ADE']:.4f}") print(f"Total Effect: {result['Total']:.4f}") print(f"% Mediated: {result['Prop_Mediated']:.1%}") print(f"ACME 95% CI: [{result['ACME_CI'][0]:.4f}, {result['ACME_CI'][1]:.4f}]")
Not everyone responds the same way. Uplift modeling estimates the Conditional Average Treatment Effect (CATE) β the treatment effect for each individual or segment.
Standard A/B tests estimate the ATE = E[Y(1)βY(0)]. Uplift modeling estimates the CATE (Conditional Average Treatment Effect) Ο(x) = E[Y(1)βY(0)|X=x] β the treatment effect as a function of individual characteristics. The fundamental challenge: you observe Y(1) or Y(0) for each unit, never both.
Meta-learners: T-Learner trains separate outcome models for treatment and control: ΟΜ(x) = ΞΌΜβ(x) β ΞΌΜβ(x). S-Learner trains one model with treatment as a feature: ΞΌΜ(x,t) and computes ΟΜ(x) = ΞΌΜ(x,1) β ΞΌΜ(x,0). X-Learner is a three-stage procedure that imputes individual treatment effects and uses propensity scores to weight between treatment and control group estimates β more efficient when groups are imbalanced.
DR-Learner (doubly robust) combines propensity scoring with outcome modeling for robustness. Causal forests (Athey & Imbens) adapt random forests to split on treatment effect heterogeneity rather than outcome prediction. Evaluation: the uplift curve (Qini curve) β sort users by predicted CATE, plot cumulative incremental outcome vs. fraction targeted.
A store sends everyone a 20%-off coupon and sees a 5% sales lift overall. But not everyone is the same:
β’ "Sure Things" β would have bought anyway. The coupon just costs you margin.
β’ "Persuadables" β buy BECAUSE of the coupon. This is where your ROI lives.
β’ "Lost Causes" β won't buy regardless. Coupon is wasted.
β’ "Sleeping Dogs" β the coupon actually annoys them (reminds them to unsubscribe). Negative effect.
Uplift modeling finds these groups. For each customer, it asks: "What's the DIFFERENCE in purchase probability with vs. without the coupon?" β not just "Will they buy?" A high-probability buyer isn't necessarily someone you should target (they might be a "Sure Thing"). You want the people with the biggest gap between coupon and no-coupon worlds.
The magic: By targeting only "Persuadables" (say top 30% by predicted uplift), you get 80% of the incremental sales at 30% of the cost. The remaining 70% of coupons were either wasted (Sure Things, Lost Causes) or actively harmful (Sleeping Dogs).
Business Context: Uber spends $200M+/year on rider promos (discounts, free rides, credits). The blunt approach: blast $5 credits to all riders who haven't ridden in 14+ days. Marketing suspects they're wasting money on two groups: (1) "sure things" who would have come back anyway, and (2) "lost causes" who won't return even with $5. They need to find the "persuadables" β users where the promo actually changes behavior.
Data Setup: Randomized promo experiment: 1M lapsed riders (14+ days since last ride), 50/50 split between $5 credit (treatment) and no credit (control). Features for the CATE model: (1) days since last ride, (2) lifetime ride count, (3) average trip value, (4) urban/suburban/rural, (5) last ride rating, (6) device type, (7) signup channel, (8) local competitor presence, (9) time since signup. Outcome: ride within 7 days (binary).
Methodology: Train a T-Learner (separate GBM models for treatment and control groups) to estimate Ο(x) = E[Y|T=1,X=x] β E[Y|T=0,X=x] for each user. Validate using the "uplift curve" on a held-out test set: rank users by predicted CATE, then plot cumulative incremental rides vs. fraction of population targeted. Compare to random targeting baseline.
Results: Average treatment effect (ATE) across all users: +3.2pp ride probability. But the distribution is highly skewed:
β’ Top decile (persuadables): CATE = +12.4pp. Profile: suburban, 30β90 days lapsed, 10β30 lifetime rides, signed up via referral.
β’ Middle 60%: CATE = +2β4pp. Marginal ROI β promo barely covers its cost.
β’ Bottom 20%: CATE = β0.5pp to +0.5pp. "Sure things" (daily commuters, CATE β 0 because they'd ride anyway) and "lost causes" (1-ride-and-done users).
β’ Surprise finding: 8% of users have negative CATE β the promo email actually reduces their ride probability (possibly because it reminds them they stopped using Uber for a reason).
Takeaway: By targeting only the top 30% by CATE, Uber achieves 78% of the total incremental rides at 30% of the promo cost. Annualized savings: $47M in promo spend. The negative-CATE finding led to removing those users from all promo campaigns (including email), reducing unsubscribe rates by 15%.
Business Context: TurboTax shows an upgrade nudge ("Unlock Deluxe for $59 β maximize your deductions") to Free Edition users at the deductions section. The average take-rate is 5.2%. But product managers suspect the nudge might be hurting filing completion for users who don't need Deluxe β the interruption causes confusion or frustration. They want to personalize: show the nudge only to users where it helps, suppress it where it hurts.
Data Setup: Randomized experiment: 800K Free Edition users, 50% see the nudge, 50% see nothing. Features: (1) has Schedule C income, (2) number of deductions started, (3) estimated refund amount so far, (4) prior-year product tier, (5) time spent on return so far, (6) state, (7) filing status, (8) device, (9) entry channel. Two outcomes: (a) upgrade to Deluxe, (b) filing completion.
Methodology: X-Learner (more efficient than T-Learner when treatment group is smaller after upgrade). Estimate CATE on both outcomes separately. Cross-validate with DR-Learner (doubly-robust) for robustness. Segment the CATE distribution into 4 groups using the "persuasion-annoyance" framework: High-Upgrade-CATE + Neutral-Completion-CATE = "Sweet Spot"; High-Upgrade + Negative-Completion = "Costly Conversion"; Low-Upgrade + Negative-Completion = "Just Annoying"; Low-Upgrade + Positive-Completion = "Ignore Nudge."
Results:
β’ "Sweet Spot" (22% of users): Schedule C filers, multiple deduction types, CATE_upgrade = +14.8pp, CATE_completion = +0.3pp. They genuinely need Deluxe and the nudge helps them find the right product.
β’ "Costly Conversion" (11%): Users with moderate complexity who upgrade but then face Deluxe's more detailed interview and abandon. CATE_upgrade = +8.2pp, CATE_completion = β3.1pp. Net revenue negative after accounting for lost completions.
β’ "Just Annoying" (31%): Simple W-2 filers. CATE_upgrade = +1.1pp, CATE_completion = β2.4pp. The nudge interrupts their flow for minimal conversion.
β’ "Ignore Nudge" (36%): CATE_upgrade = +0.4pp, CATE_completion = +0.1pp. No effect either way.
Takeaway: Showing the nudge only to "Sweet Spot" users: upgrade revenue stays at 83% of the blanket-nudge approach, but filing completion improves by 1.4pp overall (recovering the completions lost from the "Annoying" and "Costly" segments). Net impact: +$3.2M revenue per tax season. The "Costly Conversion" finding was especially important β it led to a new "Deluxe Preview" flow that lets users see what additional Deluxe questions they'd face before committing.
# Uplift Modeling with T-Learner from sklearn.ensemble import GradientBoostingRegressor import numpy as np, pandas as pd def t_learner(X_train, treatment, y_train, X_pred): """T-Learner: separate models for treated and control.""" # Model for treated m1 = GradientBoostingRegressor(n_estimators=100) m1.fit(X_train[treatment == 1], y_train[treatment == 1]) # Model for control m0 = GradientBoostingRegressor(n_estimators=100) m0.fit(X_train[treatment == 0], y_train[treatment == 0]) # CATE = E[Y|T=1,X] - E[Y|T=0,X] cate = m1.predict(X_pred) - m0.predict(X_pred) return cate # Example: Uber promo targeting np.random.seed(42) n = 10000 X = pd.DataFrame({ 'days_since_last_ride': np.random.exponential(14, n), 'lifetime_rides': np.random.poisson(20, n), 'is_suburban': np.random.binomial(1, 0.4, n), 'avg_trip_value': np.random.normal(15, 5, n), }) treatment = np.random.binomial(1, 0.5, n) # True CATE: high for lapsed suburban riders, low for frequent urban true_cate = (0.02 + 0.08 * (X['days_since_last_ride'] > 14).astype(float) + 0.05 * X['is_suburban'] - 0.001 * X['lifetime_rides']) y = 0.3 + treatment * true_cate + np.random.normal(0, 0.1, n) cate_hat = t_learner(X.values, treatment, y, X.values) # Target top 20% CATE for promo threshold = np.percentile(cate_hat, 80) targeted = cate_hat >= threshold print(f"Avg CATE (all): {cate_hat.mean():.4f}") print(f"Avg CATE (targeted): {cate_hat[targeted].mean():.4f}") print(f"Avg CATE (others): {cate_hat[~targeted].mean():.4f}")
Build a Bayesian time series model of what WOULD have happened without intervention. The gap between predicted counterfactual and observed data is the causal effect.
BSTS (Brodersen et al., 2015 β Google's CausalImpact) models the treated unit's time series as a state-space model with three components: (1) local linear trend (level + slope, both evolving as random walks), (2) seasonal component (fixed or evolving), (3) regression on control time series (untreated units whose trajectories help predict the treated unit).
The model is fit on the pre-intervention period via Bayesian MCMC. Spike-and-slab priors on regression coefficients perform automatic variable selection among control series. Post-intervention, the model generates posterior predictive draws of the counterfactual. The causal effect at each time point is: actual β counterfactual. Because it's Bayesian, you get a full posterior distribution over the effect (not just a point estimate and CI), including P(effect > 0).
Advantages over synthetic control: (1) Handles trends and seasonality explicitly. (2) Provides time-varying effects (you see the effect emerge and decay). (3) Bayesian uncertainty quantification. Disadvantage: More model assumptions (distributional, structural); synthetic control is more "nonparametric."
Imagine you started running every morning and want to know if it improved your sleep. You have 6 months of sleep data before running and 2 months after. You also track your friend's sleep (who didn't start running) and the local temperature.
BSTS builds a prediction model: Using your pre-running sleep patterns, seasonal habits, and your friend's sleep (as a "control"), it learns to predict your sleep quality. Then it asks: "Based on everything I know, what WOULD your sleep have looked like these past 2 months if you hadn't started running?"
The counterfactual: The model might say "you would have slept 6.5 hours/night." You actually slept 7.2 hours. The gap (0.7 hours) is the estimated effect of running. Your friend's data is key β if everyone's sleep improved in spring (seasonal), the model accounts for that and doesn't credit it to your running.
Why Bayesian matters: Instead of saying "the effect is 0.7 hours Β± 0.3," it says "there's a 96% probability the effect is positive, and the most likely range is 0.4β1.0 hours." This is much more useful for decision-making: "I'm 96% sure running helps my sleep" is a clearer statement than "p = 0.04."
Business Context: Uber's brand marketing team ran a 4-week outdoor billboard campaign in Chicago ("Your Uber is 3 minutes away" with real-time ETAs on digital billboards). Total spend: $1.8M. The challenge: no pre-registered hold-out (marketing decided late to measure impact), so there's no clean control group. Uber needs to estimate what Chicago's ride volume would have been without the campaign.
Data Setup: 52 weeks of pre-campaign weekly ride data for Chicago + 20 control cities with no billboard campaign. Post-treatment: 4 campaign weeks + 4 post-campaign weeks. Control series used as regression predictors: Detroit, Milwaukee, Indianapolis, St. Louis, Minneapolis (selected for similar Midwest seasonality, population density, and ride patterns). Additional regressors: week-of-year seasonal dummies, Chicago temperature, Chicago events calendar.
Methodology: BSTS decomposes Chicago's ride series into: (1) local linear trend, (2) seasonal component (52-week cycle), (3) regression component (weighted combination of control cities). The model is fit on the 52 pre-campaign weeks via Bayesian MCMC (1000 posterior samples). For each posterior draw, predict the counterfactual for the 8 post-campaign weeks. The distribution of (actual β counterfactual) gives a full posterior over the causal effect, including uncertainty.
Results: During the 4-week campaign: actual = 138K rides/week, counterfactual posterior mean = 124.5K. Average weekly effect: +13.5K rides (posterior 95% CI: [8.8K, 18.2K]). Posterior probability of positive effect: 99.7%. During the 4 post-campaign weeks: the effect decays from +11K in week 1 to +3K in week 4 (half-life β 2.3 weeks). Cumulative incremental rides: 82K. At $12 average Uber take per ride: $984K incremental revenue vs. $1.8M campaign cost. Short-term ROAS = 0.55 (negative).
Takeaway: The BSTS analysis showed the billboard campaign was unprofitable on direct ride revenue. But the decay curve was informative: awareness effects lasted only ~5 weeks. Marketing shifted budget from billboards to retargeting digital ads (where BSTS on a separate test showed ROAS = 2.3). The Bayesian framework was critical β unlike frequentist methods, it gave a full probability distribution over ROI, allowing the finance team to make risk-adjusted budget decisions.
Business Context: TurboTax launched "Snap & File" (photograph your W-2 β auto-populated return β file in under 10 minutes) in week 4 of tax season. There's no control group β the feature is available to all users. The product team needs to separate the Snap & File lift from: (1) natural seasonal ramp-up, (2) marketing spend increases in January, (3) IRS processing timeline (refund delays affect filing volume). Simply comparing week 4 vs. week 3 conflates all these factors.
Data Setup: Outcome: daily new TurboTax filing starts (nationwide). Pre-treatment: 3 prior tax seasons (TY21βTY23) of daily data + the first 3 weeks of TY24. Control series: (1) H&R Block online starts (public investor data, weekly), (2) IRS total e-file receipts (weekly), (3) Google Trends index for "file taxes" (daily). These control series capture the "background" forces affecting all tax prep, not just TurboTax.
Methodology: BSTS with control series as regressors. The model learns that TurboTax's daily starts are a predictable function of: seasonal pattern (day-of-season), IRS e-file volume (macro readiness), and Google Trends (intent). Spike-and-slab priors on the regression coefficients perform automatic variable selection. Post-Snap-&-File launch, the model predicts the counterfactual (what TurboTax starts would have been with the same seasonal/macro conditions but no Snap & File).
Results: Actual cumulative starts through week 8: 2.35M. Counterfactual: 2.09M. Estimated causal impact of Snap & File: +260K incremental starts (95% CI: [195K, 325K]). The effect is concentrated in the first 2 weeks post-launch (early adopters who were waiting for an easier way to file) and then settles to a steady +15K/week above counterfactual. Decomposition: 70% of Snap & File users are new to TurboTax (acquisition), 30% are returning users who filed earlier than they otherwise would have (pull-forward, which the team verified doesn't cannibalize later-season volume).
Takeaway: 260K incremental starts Γ 72% completion rate Γ $85 average revenue = $15.9M incremental revenue from Snap & File in its first season. This justified the $4M engineering investment (4:1 ROI) and made the case for expanding the concept to 1099s and state returns. The BSTS control series were critical β without them, the team would have attributed seasonal ramp-up to the feature, overstating the impact by ~40%.
# BSTS / CausalImpact-style Analysis import numpy as np from sklearn.linear_model import BayesianRidge def causal_impact_simple(y_treated, y_controls, intervention_idx): """Simple CausalImpact: fit on pre-period, predict post counterfactual.""" # Pre-period: fit model using control series to predict treated y_pre = y_treated[:intervention_idx] X_pre = y_controls[:, :intervention_idx].T X_post = y_controls[:, intervention_idx:].T model = BayesianRidge() model.fit(X_pre, y_pre) # Predict counterfactual for post-period y_cf, y_std = model.predict(X_post, return_std=True) y_actual = y_treated[intervention_idx:] point_effect = y_actual - y_cf cumulative_effect = np.cumsum(point_effect) return { 'avg_effect': point_effect.mean(), 'cumulative': cumulative_effect[-1], 'ci_lower': (point_effect - 1.96*y_std).mean(), 'ci_upper': (point_effect + 1.96*y_std).mean(), 'counterfactual': y_cf, 'actual': y_actual } # Example: Uber billboard campaign in Chicago np.random.seed(42) weeks = 30 intervention = 18 # Campaign starts week 18 trend = np.linspace(100, 130, weeks) # Control cities (no campaign) detroit = trend * 0.7 + np.random.normal(0, 3, weeks) milwaukee = trend * 0.5 + np.random.normal(0, 2, weeks) controls = np.array([detroit, milwaukee]) # Chicago (treated) β extra 13 rides/week post-campaign chicago = trend + np.random.normal(0, 3, weeks) chicago[intervention:] += 13 # True causal effect result = causal_impact_simple(chicago, controls, intervention) print(f"Avg weekly effect: {result['avg_effect']:.1f}K rides") print(f"95% CI: [{result['ci_lower']:.1f}, {result['ci_upper']:.1f}]") print(f"Cumulative: {result['cumulative']:.0f}K rides over post-period")
In experiments where not everyone complies (e.g., assigned to treatment but doesn't take it), CACE estimates the effect among those who actually would comply with their assignment.
In experiments with non-compliance, there are four latent subpopulations (under the principal strata framework): Compliers (take treatment when assigned, don't when not), Always-takers (take treatment regardless), Never-takers (don't take regardless), and Defiers (do the opposite of assignment). Under the monotonicity assumption (no defiers), the ITT is a mixture: ITT = Ο_c Β· CACE + Ο_a Β· 0 + Ο_n Β· 0, where Ο_c is the complier fraction.
The Wald estimator CACE = ITT / Ο_c = ITT / (E[T|Z=1] β E[T|Z=0]) is algebraically equivalent to the IV estimator using Z as an instrument for T. This connects to the LATE (Local Average Treatment Effect) framework β CACE and LATE are the same concept.
Key point: CACE β ATE. It's the effect specifically for compliers β which may differ from the effect on always-takers or never-takers. In one-sided non-compliance (control can't access treatment), always-takers don't exist, and the compliance rate simplifies to E[T|Z=1].
You mail a gym membership coupon to 1,000 people. Only 300 of them actually go to the gym (the other 700 never use the coupon). After 3 months, the 1,000 coupon-recipients lost an average of 1.5 lbs more than people who didn't get a coupon.
But that 1.5 lbs is diluted β it averages in the 700 people who never went! The 300 who actually used the gym probably lost much more. CACE says: 1.5 lbs / 30% compliance = 5 lbs for those who actually went.
Why not just compare gym-goers to non-goers? Because that's biased β gym-goers are the motivated ones who would have lost weight anyway (selection bias). CACE avoids this by only using the RANDOM variation in gym access (the coupon), scaled up to account for the people who ignored it.
Two useful numbers from one experiment: The ITT (1.5 lbs) answers "What happens if we mail coupons to everyone?" β useful for budgeting the mailing. The CACE (5 lbs) answers "How much does actually going to the gym help?" β useful for understanding the intervention's true potency and deciding whether to invest in activation.
Business Context: Uber tested a new "Earnings Guarantee" program: drivers were promised a minimum of $25/hour during peak hours if they stayed online. The experiment randomly assigned 10K drivers to receive the offer (Z = 1). But only 62% of assigned drivers actually activated the guarantee (T = 1) β the rest didn't open the email, didn't understand the terms, or drove during off-peak hours. Uber can't force compliance. The ITT understates the program's potential; the per-protocol analysis (comparing only activators) is biased by self-selection (motivated drivers both activate and drive more hours regardless).
Data Setup: Z: random assignment to receive offer (10K treatment, 10K control). T: actually activated the guarantee (6,200 in treatment group, 0 in control β one-sided non-compliance). Y: weekly online hours in the 4 weeks post-assignment. Pre-period: 4 weeks of baseline hours for all drivers.
Methodology: CACE = ITT_Y / ITT_T. ITT_Y: assigned drivers drove 1.85 more hours/week than control (p = 0.003). ITT_T: assignment increased activation probability by 0.62 (62% in treatment vs. 0% in control). CACE = 1.85 / 0.62 = 2.98 hours/week. Interpretation: among "compliers" (drivers who would activate if offered but wouldn't otherwise), the guarantee increases driving by ~3 hours/week.
Results: ITT = +1.85 hrs/wk (the "policy effect" if you offer the program to everyone). CACE = +2.98 hrs/wk (the actual behavioral effect on drivers who engage). Per-protocol (naive): +4.1 hrs/wk (upward biased β activators are inherently more motivated). The CACE is the right number for: (1) cost-benefit analysis ($25/hr guarantee cost Γ 3 extra hours = $75/week per complier, generating ~$90/week in gross bookings β profitable), (2) deciding whether to invest in activation UX vs. the guarantee itself (since CACE is high, the bottleneck is activation, not the incentive's effectiveness).
Takeaway: The CACE analysis separated two problems: "Does the guarantee work for those who use it?" (yes, +3 hrs/wk) and "Can we get more people to use it?" (only 62% activated). The team invested in: (1) in-app push notifications instead of email (activation rose to 78% in the next test), (2) simplified terms ("Drive peak, earn at least $25/hr" instead of the legalistic original). The combined effect in the follow-up: ITT rose from 1.85 to 2.67 hrs/wk, purely from better compliance.
Business Context: TurboTax experimented with showing a "Talk to an Expert β Free 5-Minute Consultation" prompt at the deductions section (the point of highest abandonment). The experiment randomly assigned 200K users to see the prompt (Z = 1) or not. Of those who saw it, only 24% clicked through and completed a consultation (T = 1). Leadership wants to know two things: (1) What's the value of the prompt itself (ITT)? (2) What's the value of the actual expert consultation (CACE)? These answer different business questions β the prompt cost is nearly zero, but expert staffing costs $35/session.
Data Setup: Z: randomly assigned to see prompt (100K treatment, 100K control). T: actually completed a consultation (24K in treatment, ~200 in control via organic discovery β near-zero, treated as 0). Y: filing completion (binary). Compliance rate: ITT_T = 0.24.
Results: ITT on completion: +2.1pp (p = 0.001). Showing the prompt to everyone increases completion by 2.1pp, regardless of whether users click through. This includes a "reassurance effect" β just knowing help is available may reduce anxiety. CACE = 2.1 / 0.24 = 8.75pp. Among users who would actually consult an expert if prompted (compliers), the consultation increases completion by ~9pp.
Business Implications: The ITT (2.1pp) justifies showing the prompt to all users β it's free and lifts completion. The CACE (8.75pp) feeds the expert staffing ROI model: 8.75pp Γ $120 average revenue = $10.50 incremental revenue per consultation vs. $35 cost. That's negative ROI on completion alone! But including downstream retention (expert-consulted users return at 15pp higher rates) and word-of-mouth makes it +ROI over 2-year LTV. The CACE analysis prevented the team from either: (a) killing the program based on session-level ROI, or (b) over-scaling it based on the inflated per-protocol estimate of +18pp.
# CACE β Complier Average Causal Effect import numpy as np def cace_estimate(assignment, treatment_taken, outcome): """Wald estimator for CACE under one-sided non-compliance.""" # ITT: effect of assignment on outcome itt = outcome[assignment == 1].mean() - outcome[assignment == 0].mean() # First stage: effect of assignment on actual treatment compliance = (treatment_taken[assignment == 1].mean() - treatment_taken[assignment == 0].mean()) cace = itt / compliance return {'ITT': itt, 'compliance_rate': compliance, 'CACE': cace} # Example: Intuit live expert prompt experiment np.random.seed(42) n = 6000 Z = np.random.binomial(1, 0.5, n) # Random assignment to see prompt # Only 25% of assigned users click through T = Z * np.random.binomial(1, 0.25, n) # One-sided non-compliance # True effect of actually talking to expert: +8pp on completion Y = np.random.binomial(1, np.clip(0.65 + 0.08 * T, 0, 1)) result = cace_estimate(Z, T, Y) print(f"ITT (intent-to-treat): {result['ITT']:.4f}") print(f"Compliance rate: {result['compliance_rate']:.4f}") print(f"CACE (compliers): {result['CACE']:.4f}") # ~0.08