Statistics & ML Concepts — Interactive Guide

Statistics & ML Concepts Guide

Interactive study guide consolidating interview-level concepts — from hypothesis testing to XGBoost, with case studies at Uber & Intuit.

Hypothesis TestingConfidence IntervalsPower AnalysisSequential TestingBayesian MethodsDistributionsRegressionTree ModelsClustering

Topics Covered

Section	Category	Key Topics
1. Hypothesis Testing	Foundations	t-test, z-test, Mann-Whitney, p-values, Type I/II error
2. Confidence Intervals	Foundations	t-based CI, proportion CI, Wilson score, bootstrap CI
3. Power & Sample Size	Experimental Design	Power analysis, Cohen's d, MDE, sample size calculation
4. Multiple Testing	Experimental Design	Bonferroni, Benjamini-Hochberg, ANOVA, chi-squared
5. Sequential Testing	Advanced Experimentation	mSPRT, always-valid p-values, O'Brien-Fleming, peeking
6. Variance Reduction	Advanced Experimentation	CUPED, delta method, block bootstrap
7. Bayesian & Bandits	Advanced Experimentation	Posterior inference, Thompson sampling, UCB, contextual bandits
8. Distributions	Foundations	Normal, binomial, Poisson, exponential, CLT, MLE
9. Regression & Core ML	Machine Learning	Linear/logistic regression, Ridge, Lasso, ROC/AUC
10. Trees & Ensembles	Machine Learning	Decision trees, Random Forest, XGBoost, AdaBoost
11. Clustering & Special	ML & Special Topics	K-means, PCA, DBSCAN, DAGs, Simpson's paradox, imbalanced data

Statistical Foundations

Hypothesis testing, confidence intervals, power analysis, distributions — the building blocks of statistical inference that every DS interview tests.

Experimental Design

Multiple testing correction, sequential tests, variance reduction — running experiments correctly at scale, the way Uber and Intuit do it.

Advanced Methods

Bayesian inference, multi-armed bandits, causal reasoning — going beyond basic A/B testing into adaptive and observational methods.

Machine Learning

Regression, tree-based models, clustering, dimensionality reduction — predictive modeling from linear models to XGBoost.

Foundations

Hypothesis Testing Foundations

The bedrock of statistical inference — from null hypotheses to non-parametric alternatives.

Welch's t-statistic

t = (x̄₁ - x̄₂) / √(s₁²/n₁ + s₂²/n₂)

Signal-to-noise ratio for comparing two means. Larger |t| = more confident the effect is real.

Statistical Perspective

Hypothesis testing is a decision framework. We assume H₀ (no effect) and compute P(data | H₀). The p-value is the probability of observing results as extreme as ours if H₀ is true. If p < α (typically 0.05), we reject H₀.

The t-statistic measures the signal-to-noise ratio — how many standard errors the observed difference is from zero. Welch's t-test (default in scipy) doesn't assume equal variance. For non-normal data or small samples, Mann-Whitney U provides a non-parametric alternative that compares rank orderings rather than means.

Always check assumptions: independence (each observation in one group only), approximate normality (Shapiro-Wilk test, Q-Q plots), or sufficient n for CLT (n > 30). The z-test for proportions uses SE = √(p(1-p)/n) where variance is fully determined by p itself — no need to estimate it from the sample.

Intuitive Perspective

Imagine flipping a coin 100 times and getting 60 heads. Is the coin unfair, or did you just get lucky? That's hypothesis testing — we start by assuming nothing special is happening (the null hypothesis) and ask "how surprising is this result?"

The p-value answers: "If the coin WERE fair, what's the chance of seeing 60+ heads?" If that chance is tiny (< 5%), we conclude the coin is probably unfair.

The t-test applies this logic to comparing two groups — it asks whether the difference between group averages is too large to be explained by random variation alone. Key insight: a significant p-value means "unlikely under the null." It does NOT tell you the effect is large or practically meaningful.

Uber Dispatch Algorithm A/B Test

Business Context: Testing whether Smart Match v2 (considering driver-rider compatibility) improves ride completion rate. Two-sample t-test on completion rates between treatment and control.

Setup: n=400K riders, 2-week test. Welch's t-test used because variance differs across rider segments (power commuters vs. casual users).

Results: Treatment mean 0.31pp higher, t=1.35, p=0.18. Supplemented with Mann-Whitney U (p=0.21) — both tests agree: not significant without variance reduction. This motivated applying CUPED (Section 6), which reduced SE enough to reach significance.

Takeaway: When both parametric and non-parametric tests agree, the conclusion is robust. The effect was real but masked by noise — a common scenario in marketplace experiments.

Intuit Default Browser Prompt A/B Test

Business Context: Testing whether a redesigned prompt increases default browser switching rate. Binary outcome (switched / didn't).

Setup: Z-test for proportions since outcome is binary. p_treatment=12.3%, p_control=11.8%. Assumptions: np ≥ 5 ✓, random assignment ✓.

Results: z=1.89, p=0.059. Not significant at α=0.05 but close. Team runs Mann-Whitney as robustness check, decides to extend test 1 more week rather than ship on a borderline result.

Takeaway: Borderline results (0.05 < p < 0.10) warrant extending the test, not making a binary ship/no-ship decision. Always compute effect size alongside the p-value.

Interactive: How Sample Size Affects the t-Test

Effect Size (Cohen's d): 0.30

Sample Size (per group): 100

t-statistic: 2.12

p-value: 0.034 | Significant at α=0.05

Python Example

# Two-Sample Hypothesis Testing Workflow
import numpy as np
from scipy import stats

# Step 1: Explore data
control = df[df['group'] == 'control']['metric'].values
treatment = df[df['group'] == 'treatment']['metric'].values
print(f"Control: mean={control.mean():.3f}, std={control.std():.3f}, n={len(control)}")
print(f"Treatment: mean={treatment.mean():.3f}, std={treatment.std():.3f}, n={len(treatment)}")

# Step 2: Check normality
_, p_shapiro = stats.shapiro(control[:500])  # Shapiro sensitive at large n
print(f"Shapiro-Wilk p={p_shapiro:.4f}")

# Step 3: Welch's t-test (default — doesn't assume equal variance)
t_stat, p_value = stats.ttest_ind(treatment, control, equal_var=False)
print(f"Welch's t-test: t={t_stat:.3f}, p={p_value:.4f}")

# Step 4: Non-parametric alternative
u_stat, p_mw = stats.mannwhitneyu(treatment, control, alternative='two-sided')
print(f"Mann-Whitney U: U={u_stat:.0f}, p={p_mw:.4f}")

# Step 5: Effect size (Cohen's d)
pooled_std = np.sqrt((control.std()**2 + treatment.std()**2) / 2)
cohens_d = (treatment.mean() - control.mean()) / pooled_std
print(f"Cohen's d: {cohens_d:.3f}")

# Step 6: Z-test for proportions (binary outcomes)
from statsmodels.stats.proportion import proportions_ztest
z_stat, p_prop = proportions_ztest([conv_treat, conv_ctrl], [n_treat, n_ctrl])

Key Assumptions & Pitfalls

Independent samples — each observation in one group only, no cross-contamination between users
Approximately normal distributions, or n > 30 for CLT to apply
Welch's t-test handles unequal variance; use it by default over Student's t
Mann-Whitney tests for stochastic dominance (ranks), not means — use when normality is violated
A significant p-value does NOT mean the effect is large — always report effect size (Cohen's d)
Default to two-tailed tests unless you have a strong directional prior stated before seeing data

Foundations

Confidence Intervals

Quantifying uncertainty — how precise is your estimate?

Confidence Interval

CI = x̄ ± z* · SE

For means: SE = s/√n. For proportions: SE = √(p(1-p)/n). 95% CI uses z* = 1.96.

Statistical Perspective

A confidence interval provides a range of plausible values for an unknown parameter. The frequentist interpretation: if we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter.

For means, SE = s/√n and we use the t-distribution (converges to z for large n). For proportions, SE = √(p(1-p)/n) and variance is fully determined by p, so we use z directly.

The Wilson score interval improves on the normal approximation for extreme proportions (p < 0.01 or p > 0.99) or small n — it respects the [0,1] boundary and is asymmetric. Bootstrap CIs (resample 10K times, take percentiles) require no distributional assumptions and work for any statistic.

Intuitive Perspective

Think of a confidence interval as a "net" you cast to catch the true value. A 95% CI means if you went fishing 100 times with the same net, you'd catch the fish 95 times.

The net's width depends on two things: how noisy your data is (more noise = wider net) and how much data you have (more data = narrower net).

A CI that doesn't include zero for a treatment effect means "we're confident there IS an effect." A wide CI means "there's probably an effect, but we're not sure how big." If the CI is entirely above zero but narrow — that's the best case: confident effect with a precise estimate.

Uber Surge Pricing Elasticity

Business Context: Estimating price elasticity of demand for rides. Point estimate: -0.35 (a 10% price increase reduces demand by 3.5%).

Results: 95% CI: [-0.42, -0.28]. CI is narrow because n=2M rides. The entire CI is negative, confirming demand falls with price.

Business Impact: At the CI lower bound (-0.42), a 20% surge still reduces demand by only 8.4%, justifying current surge levels. The narrow CI gave leadership confidence to maintain the pricing strategy.

Intuit Filing Completion Rate CI

Business Context: Measuring completion rate for a new W-2 import flow.

Results: Treatment: 73.2% (95% CI: [72.1%, 74.3%]). Control: 71.4% (95% CI: [70.3%, 72.5%]). CIs overlap slightly, but z-test on the difference shows p=0.008.

Key Lesson: Overlapping CIs do NOT mean no significant difference — the CI on the difference is [0.5pp, 3.1pp], entirely above zero. Always test the difference directly.

Interactive: CI Width vs Sample Size

Proportion (p): 0.50

Sample Size (n): 100

Confidence Level:

95% CI: [0.402, 0.598]

SE: 0.050 | Width: 0.196

Python Example

# Confidence Intervals — Means vs Proportions
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportion_confint

# CI for continuous data (t-based)
mean_diff = treatment.mean() - control.mean()
se_diff = np.sqrt(treatment.var()/len(treatment) + control.var()/len(control))
ci_t = (mean_diff - 1.96 * se_diff, mean_diff + 1.96 * se_diff)

# CI for proportions (normal approximation)
p1, n1 = 0.732, 5000
se1 = np.sqrt(p1 * (1 - p1) / n1)
ci_normal = (p1 - 1.96 * se1, p1 + 1.96 * se1)

# Wilson score interval (better for small n or extreme p)
ci_wilson = proportion_confint(int(p1 * n1), n1, alpha=0.05, method='wilson')

# Bootstrap CI (works for any statistic)
boot_means = [np.random.choice(data, size=len(data), replace=True).mean()
              for _ in range(10_000)]
ci_boot = np.percentile(boot_means, [2.5, 97.5])

Key Assumptions & Pitfalls

WRONG: "95% probability the true value is in this range" — treats CI as random, parameter as fixed (backwards)
RIGHT: "If repeated many times, 95% of intervals would contain the true parameter"
Overlapping CIs do NOT mean no significant difference — always test the difference directly
Normal approximation breaks for p < 0.01 or p > 0.99, or n < 30 — use Wilson or bootstrap
Bootstrap CIs are the universal fallback — no distributional assumptions needed
Wider CI = more uncertainty. Narrow CI = precise estimate. Width ∝ 1/√n

Experimental Design

Statistical Power & Sample Size

How big a sample do you need to detect an effect? The most important pre-experiment question.

Power & Sample Size

Power = 1 - β | n ≈ 2 · ((z_α + z_β) / d)²

Where d = Cohen's d (effect/σ), z_α = 1.96 for α=0.05, z_β = 0.84 for 80% power

Statistical Perspective

Power = P(reject H₀ | H₁ is true) = 1 - β. It depends on four quantities: α (significance level), n (sample size), σ (noise), and δ (true effect size).

Pre-experiment, you fix α (usually 0.05), target power (usually 0.80), estimate σ from historical data, and choose MDE (minimum detectable effect) from the business — then solve for n. Cohen's d = δ/σ standardizes effect size: 0.2 = small, 0.5 = medium, 0.8 = large.

For proportions, variance is p(1-p), so metrics near 50% need larger samples than those near 5% or 95%. Variance reduction techniques like CUPED effectively increase power without more data.

Intuitive Perspective

Imagine trying to hear someone whisper in a noisy room. Power is the chance you'll actually hear them. To increase that chance:

Make the room quieter — reduce variance (CUPED)
Ask them to speak louder — larger effect size
Move closer — collect more data
Lower your standards — increase α (accept more false positives)

Power analysis tells you: "How long do we need to run this test before we can reliably detect the improvement we care about?" An underpowered test is a waste — you run it but can't detect the effect.

Uber ETA Model Rollout

Business Context: New ML model expected to reduce ETA error by 3%. Historical σ = 12%, so Cohen's d = 0.25 (small effect).

Power Analysis: At α=0.05, power=0.80: need n=252 per group. With city-level clustering (design effect ≈ 1.5), effective n = 378 per group. Test duration: 3 days at current traffic.

Takeaway: Small effects require careful planning. The clustering adjustment (design effect) nearly doubled the required sample — ignoring it would have led to an underpowered test.

Intuit Tax Filing Nudge Campaign

Business Context: Testing whether an email nudge increases filing starts by 2pp (from 45% to 47%).

Power Analysis: σ = √(0.45×0.55) = 0.497. At α=0.05, power=0.80: n=3,920 per group → 1 week. Team decides MDE=1pp for more sensitivity → n=15,680 per group → 3 weeks.

Takeaway: Halving the MDE quadruples the sample size. The team had to trade sensitivity for speed during the 12-week tax season.

Interactive: Power Calculator

Significance Level (α): 0.05

Power (1-β): 0.80

Effect Size (Cohen's d): 0.20

Required n per group: 394

Total participants: 788 | At 10K/day traffic: ~1 days

Python Example

# Power Analysis & Sample Size Calculation
from scipy import stats
import numpy as np

def required_sample_size(alpha, power, effect_size):
    """Calculate required n per group for a two-sample t-test."""
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))

# Example: detect 2pp lift in filing completion
baseline_rate = 0.45
mde = 0.02
sigma = np.sqrt(baseline_rate * (1 - baseline_rate))
cohens_d = mde / sigma
n_required = required_sample_size(0.05, 0.80, cohens_d)
print(f"Cohen's d: {cohens_d:.3f}")
print(f"Required n per group: {n_required:,}")

# Using statsmodels (more precise)
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
n_precise = analysis.solve_power(effect_size=cohens_d, alpha=0.05, power=0.80)
print(f"Statsmodels n per group: {int(np.ceil(n_precise)):,}")

Key Assumptions & Pitfalls

Power analysis must be done BEFORE the experiment — doing it after is circular reasoning
MDE comes from the business, not statistics: "What's the smallest change worth caring about?"
Underpowered tests are wasteful — you run the experiment but can't detect the effect
Halving the MDE quadruples the required sample size (n ∝ 1/d²)
Variance reduction (CUPED) effectively increases power without collecting more data
Account for clustering (design effect) in marketplace experiments — ignoring it leads to underpowered tests

Experimental Design

Multiple Testing & FDR Control

When you test many hypotheses, false positives accumulate — here's how to control them.

Correction Methods

Bonferroni: α_adj = α/m | BH: reject all p_(i) ≤ (i/m)·α

Bonferroni controls FWER (any false positive). BH controls FDR (proportion of false discoveries).

Statistical Perspective

With m independent tests at α=0.05, P(≥1 false positive) = 1 - (1-α)^m. For m=20, that's 64%. Bonferroni divides α by m — simple but very conservative, especially for large m.

Benjamini-Hochberg (BH) controls the False Discovery Rate: the expected proportion of rejected hypotheses that are false. Procedure: rank p-values, compare each p_(i) to (i/m)·α, reject all up to the largest that passes. BH is uniformly more powerful than Bonferroni.

Use Bonferroni for few high-stakes tests (2-3 variants). Use BH for many tests (feature screening, multi-metric dashboards). Always correct across ALL tests you ran, not just the ones with small p-values.

Intuitive Perspective

Imagine checking 20 different metrics for an A/B test. Even if the treatment does nothing, you'd expect 1 metric to show p < 0.05 by pure chance (20 × 0.05 = 1). That's the multiple testing problem.

Bonferroni is the strict parent: "You can only claim significance if a result would be surprising even accounting for ALL the doors you opened."

BH is the pragmatic boss: "I accept that some findings might be false, but I want to keep that proportion low." In practice, BH finds more true effects while keeping the false discovery rate at 5%.

Uber Multi-Variant Rider App Experiment

Business Context: Testing 4 new UI variants for the ride request flow, measuring CTR, completion rate, and time-to-book. 4 variants × 3 metrics = 12 tests.

Results: Without correction: 3 show p < 0.05. Bonferroni (α = 0.05/12 = 0.0042): only 1 survives. BH (FDR = 0.05): 2 survive.

Decision: Team uses BH as primary analysis, confirms surviving results with a focused follow-up test. The test that passed Bonferroni gets immediate rollout.

Intuit Tax Season Feature Launch

Business Context: Launching 3 features simultaneously, each with its own A/B test. Primary metric: filing completion.

Results: With Bonferroni threshold 0.05/3 = 0.017: Feature A (p=0.003) ✓, Feature B (p=0.032) ✗, Feature C (p=0.41) ✗.

Decision: Ship A confidently. Extend B's test for more power (it would pass BH but not Bonferroni — the risk tolerance decision is the team's). Kill C.

Interactive: Multiple Testing — False Positive Accumulation

Number of tests (m): 10

P(≥1 false positive): 40.1%

Bonferroni threshold: 0.0050 | Uncorrected: 0.0500

Python Example

# Multiple Testing Correction Workflow
from scipy import stats
from statsmodels.stats.multitest import multipletests
import numpy as np

# 4 variants × 3 metrics = 12 pairwise tests
p_values = [0.003, 0.012, 0.032, 0.048, 0.051, 0.089,
            0.12, 0.23, 0.34, 0.41, 0.67, 0.91]

# Bonferroni correction (FWER control)
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni rejections:", sum(reject_bonf))

# Benjamini-Hochberg (FDR control)
reject_bh, pvals_bh, _, _ = multipletests(p_values, method='fdr_bh')
print("BH rejections:", sum(reject_bh))

# Chi-squared omnibus → pairwise z-tests
chi2, p_omnibus, dof, expected = stats.chi2_contingency(observed_table)
if p_omnibus < 0.05:
    # Pairwise z-tests with BH correction
    from statsmodels.stats.proportion import proportions_ztest
    pairwise_p = []
    for variant in variants:
        z, p = proportions_ztest([variant.conv, ctrl.conv], [variant.n, ctrl.n])
        pairwise_p.append(p)
    reject, adj_p, _, _ = multipletests(pairwise_p, method='fdr_bh')

Key Assumptions & Pitfalls

Always correct across ALL tests you ran, not just the ones with small p-values (this IS p-hacking)
Chi-squared is the omnibus test for proportions (3+ groups); ANOVA for continuous metrics
After omnibus rejection, do pairwise tests with correction — never skip the omnibus step
Adjust CIs to match: if Bonferroni at α/3 = 0.017, report 98.3% CIs for consistency
BH controls FDR under independence or positive dependence — safe for most A/B scenarios
Bonferroni is over-conservative with many tests — you'll miss real effects

Advanced Experimentation

Sequential Testing

How to monitor experiments without inflating false positives — peeking done right.

Sequential Methods

Always-valid p-value: p̃_t = min(1, 1/Λ_t) | O'Brien-Fleming: α_k decreasing at each look

Sequential methods control Type I error across ALL possible stopping times, not just one.

Statistical Perspective

Traditional hypothesis tests assume a fixed sample size. If you peek at results during data collection and stop when p < 0.05, the actual false positive rate can be 20-30% (far above 5%).

mSPRT (mixture Sequential Probability Ratio Test) uses a likelihood ratio that remains a valid test statistic regardless of when you look. Group sequential designs (O'Brien-Fleming, Pocock) pre-specify a small number of interim analyses with adjusted significance boundaries.

Always-valid confidence sequences provide CIs valid at every time point simultaneously. Sequential tests trade power for flexibility — typically needing 20-30% more total samples than fixed-horizon tests.

Intuitive Perspective

Imagine betting on coin flips. If you keep flipping until you happen to be winning and then stop, you'll appear lucky even with a fair coin. That's what "peeking" does to A/B tests.

Sequential testing is like agreeing to rules BEFORE you start: "I can check after every 100 flips, but I need stronger evidence early on." Early checks require overwhelming evidence (p < 0.001), while later checks relax the threshold.

This way, even though you're peeking, your overall false positive rate stays at 5%. O'Brien-Fleming is the most popular approach — very conservative early, lenient late.

Uber Surge Pricing Algorithm Monitoring

Business Context: Uber uses mSPRT for marketplace experiments where wrong decisions cost money in real-time.

Approach: New surge algorithm monitored daily with always-valid p-values. mSPRT allows stopping as soon as significance is reached without FPR inflation.

Results: Detected 1.2% improvement in driver utilization on day 5 (traditional test would require day 14). Estimated $2M additional revenue captured by shipping 9 days earlier.

Intuit Tax Season Feature Gate

Business Context: During the 12-week filing season, every week of testing = a week the winning variant isn't fully deployed.

Approach: Group sequential design with 3 interim looks (weeks 1, 2, 4) using O'Brien-Fleming bounds: α spent 0.0001, 0.005, 0.043.

Results: Feature showed strong signal at week 2 (p=0.002 < 0.005 boundary), shipped immediately — gaining 10 extra weeks of improved experience during peak season.

Interactive: Peeking Inflates False Positives

Number of peeks: 1

Actual FPR with 1 peeks: 5.0%

Nominal α = 5.0% | Inflation factor: 1.0×

Python Example

# Sequential Testing: Why Peeking Inflates FPR
import numpy as np
from scipy import stats

def simulate_peeking_fpr(n_peeks, n_sims=10000, alpha=0.05):
    """Simulate false positive rate when peeking at results."""
    false_pos = 0
    n_per_peek = 100
    z_crit = stats.norm.ppf(1 - alpha / 2)
    for _ in range(n_sims):
        a = np.random.normal(0, 1, n_per_peek * n_peeks)
        b = np.random.normal(0, 1, n_per_peek * n_peeks)
        for peek in range(1, n_peeks + 1):
            n = peek * n_per_peek
            z = abs((a[:n].mean() - b[:n].mean()) / np.sqrt(2/n))
            if z > z_crit:
                false_pos += 1
                break
    return false_pos / n_sims

for peeks in [1, 2, 5, 10, 20, 50]:
    fpr = simulate_peeking_fpr(peeks)
    print(f"Peeks: {peeks:2d} → Actual FPR: {fpr:.1%}")
# Peeks:  1 → 5.0%   Peeks:  5 → 14.2%
# Peeks: 10 → 19.3%  Peeks: 50 → 32.1%

Key Assumptions & Pitfalls

Traditional A/B tests with fixed-horizon analysis are invalid if you peek and stop early
mSPRT / always-valid p-values allow continuous monitoring without FPR inflation
O'Brien-Fleming is the most popular group sequential design — conservative early, lenient late
Sequential tests trade power for flexibility — typically need 20-30% more total samples
Never use a significant interim result as final if you didn't pre-specify interim analyses
Uber's experimentation platform uses sequential testing by default for marketplace experiments

Advanced Experimentation

Variance Reduction — CUPED & Delta Method

Make experiments more sensitive without collecting more data.

CUPED & Delta Method

CUPED: Ŷ_cv = Y − θ(X − E[X]) | Delta: Var(g(θ̂)) ≈ [g'(θ)]² · Var(θ̂)

θ = Cov(Y,X)/Var(X). CUPED reduces variance by ρ². Delta method gives SE for ratio metrics.

Statistical Perspective

CUPED is a control variate method. By subtracting the predictable part of the outcome using a pre-experiment covariate X, we reduce variance by ρ²(X,Y). The adjustment is mean-zero (E[X − E[X]] = 0), so it doesn't bias the treatment effect.

The delta method approximates the variance of a function of random variables using a first-order Taylor expansion. For ratio metrics like revenue/trip: Var(Ŷ/X̄) ≈ (1/μ_X²)[Var(Y) − 2(μ_Y/μ_X)Cov(Y,X) + (μ_Y/μ_X)²Var(X)]/n.

Block bootstrap provides a non-parametric alternative when delta method's normality assumption is questionable. CUPED + delta method can be combined for ratio metrics with pre-period covariates.

Intuitive Perspective

CUPED is like adjusting for natural ability when grading students. If you know a student always scores 90%, seeing them score 92% in treatment is a 2-point surprise — that's the signal. Without CUPED, "92% vs 91%" is drowned in noise.

The delta method solves a different problem: what if your metric is a ratio, like revenue per trip? You can't just average the per-user ratios — users with 1 trip have volatile ratios. The delta method tells you how uncertain the overall ratio is by propagating uncertainty from numerator and denominator.

The better you can predict baseline performance (higher ρ), the more noise CUPED strips away. ρ = 0.7 removes 49% of variance; ρ = 0.9 removes 81%.

Uber Revenue per Trip — Delta Method

Business Context: Testing a new pricing algorithm. Primary metric: revenue per completed trip (a ratio). Naive per-user ratio averaging gives volatile estimates.

Approach: Delta method: treat as ratio of means (total revenue / total trips). SE reduced by 40% vs. naive averaging. Combined with CUPED (pre-period revenue as covariate, ρ=0.78): total variance reduction = 73%.

Takeaway: For ratio metrics, always use delta method. Naive per-user ratios are dominated by low-denominator noise.

Intuit CUPED for Filing Completion

Business Context: Pre-experiment covariate: prior-year filing completion + number of forms (ρ = 0.74).

Results: CUPED variance reduction: 55%. Experiment that would need 3 weeks raw now reaches significance in ~1.3 weeks. During peak tax season, this saved ~12 days of suboptimal UX.

Takeaway: In tax filing, timing is everything. 12 weeks of filing season × $2M/week of delayed features = CUPED pays for itself immediately.

Interactive: CUPED Variance Reduction

Correlation ρ(X, Y): 0.70

Original Variance (σ²): 50

Variance Reduction: 49.0%

Adjusted Variance: 25.5 | Sample size equivalent: 1.96×

Python Example

# CUPED & Delta Method
import numpy as np

# --- CUPED ---
def cuped_adjust(y_post, x_pre):
    """Adjust outcomes using pre-experiment covariate."""
    theta = np.cov(y_post, x_pre)[0,1] / np.var(x_pre)
    return y_post - theta * (x_pre - np.mean(x_pre))

np.random.seed(42)
n = 10000
x_pre = np.random.normal(65, 20, n)
treatment = np.random.binomial(1, 0.5, n)
y_post = 0.7 * x_pre + treatment * 2.0 + np.random.normal(0, 15, n)

y_adj = cuped_adjust(y_post, x_pre)
print(f"Raw SE:   {y_post.std()/np.sqrt(n/2):.3f}")
print(f"CUPED SE: {y_adj.std()/np.sqrt(n/2):.3f}")
print(f"Variance reduction: {1 - y_adj.var()/y_post.var():.1%}")

# --- Delta Method for Ratio Metrics ---
def delta_method_ratio_se(y, x):
    """SE of mean(y)/mean(x) via delta method."""
    n = len(y)
    mu_y, mu_x = y.mean(), x.mean()
    var_y, var_x = y.var(), x.var()
    cov_yx = np.cov(y, x)[0,1]
    return np.sqrt((var_y - 2*(mu_y/mu_x)*cov_yx + (mu_y/mu_x)**2*var_x) / (mu_x**2*n))

Key Assumptions & Pitfalls

CUPED requires pre-experiment covariate X measured BEFORE randomization
Higher ρ(X, Y) = more variance reduction — stale pre-period data may have low ρ
Delta method assumes approximate normality of the estimator (CLT for large n)
For heavily skewed ratio metrics, block bootstrap may be more reliable than delta method
CUPED + delta method can be combined for ratio metrics with pre-period covariates
CUPED is equivalent to regression adjustment — the treatment effect estimate is identical

Advanced Experimentation

Bayesian Methods & Multi-Armed Bandits

From posterior beliefs to adaptive allocation — when exploration meets exploitation.

Bayesian Inference & Thompson Sampling

P(θ|data) ∝ P(data|θ) · P(θ) | Thompson: sample θ̃_k ~ Beta(α_k, β_k), pull argmax

Bayesian gives P(B > A) directly. Bandits optimize cumulative reward during the test.

Statistical Perspective

Bayesian A/B testing replaces p-values with posterior distributions. For binary outcomes, Beta-Bernoulli conjugacy gives closed-form updates: prior Beta(α₀, β₀) + s successes in n trials → posterior Beta(α₀+s, β₀+n-s).

Compute P(θ_B > θ_A | data) via simulation or closed-form. Expected loss = E[max(θ_A - θ_B, 0)] measures the cost of choosing B if A is actually better. Ship when P(best) > threshold AND expected loss is small.

Thompson Sampling: sample from each arm's posterior, pull the arm with the highest sample. Naturally balances exploration (uncertain arms get sampled widely) and exploitation (good arms get pulled more). UCB picks the arm with the highest upper confidence bound. Contextual bandits condition on user features for personalized allocation.

Intuitive Perspective

Frequentist testing asks: "Is B better? Yes or no." Bayesian asks: "HOW MUCH better is B, and how sure am I?" You start with a belief (prior), update with data, and get a posterior that directly says "94% chance B is better by 2-5%."

Bandits go further: instead of 50/50 traffic split the whole time, they gradually send more users to the winning variant. Like a restaurant trying new dishes — you taste-test, then shift to mostly serving the better one.

Key tradeoff: A/B tests give clean causal inference. Bandits optimize cumulative reward but sacrifice clean inference — you can't compute unbiased treatment effects from bandit data.

Uber Dynamic Pricing — Contextual Bandits

Business Context: Uber uses contextual MABs for surge pricing optimization. Context: time of day, location, demand/supply ratio, weather. Arms: different surge multiplier levels.

Results: Thompson sampling with logistic regression on context features outperforms fixed pricing rules by 4.3% in driver utilization. Bandits adapt to local conditions — surge that works in Manhattan at 6pm differs from Austin at 2am.

Tradeoff: Bandits sacrifice clean causal inference for better cumulative outcomes. For pricing, cumulative reward matters more than a clean ATE estimate.

Intuit Onboarding Flow — Bayesian A/B

Business Context: Testing 3 onboarding variants with Bayesian analysis. Prior: Beta(1,1) = uniform. After 10K users: P(V2 best) = 89%.

Decision rule: Ship when P(best) > 95% AND expected loss < 0.5pp. V2 reached both thresholds at 15K users — shipped 1 week earlier than a frequentist test would have allowed.

Takeaway: Bayesian analysis naturally answers "how confident are we?" without arbitrary α thresholds. Expected loss gives a direct business-relevant decision criterion.

Interactive: Thompson Sampling Simulation

Three arms with hidden true probabilities. Watch Thompson sampling discover the best arm.

Arm	True p	Est. Rate
A	???	-
B	???	-
C	???	-

Python Example

# Bayesian A/B Testing & Thompson Sampling
import numpy as np

# --- Bayesian A/B Test ---
alpha_A, beta_A = 1 + 450, 1 + 5000 - 450   # Control: 9.0%
alpha_B, beta_B = 1 + 520, 1 + 5000 - 520   # Treatment: 10.4%

samples_A = np.random.beta(alpha_A, beta_A, 100000)
samples_B = np.random.beta(alpha_B, beta_B, 100000)
p_b_better = (samples_B > samples_A).mean()
loss_B = np.maximum(samples_A - samples_B, 0).mean()
print(f"P(B > A) = {p_b_better:.3f}")
print(f"Expected loss of choosing B: {loss_B:.5f}")

# --- Thompson Sampling ---
class ThompsonBandit:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)
        self.beta = np.ones(n_arms)
    def select_arm(self):
        samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)]
        return np.argmax(samples)
    def update(self, arm, reward):
        if reward: self.alpha[arm] += 1
        else: self.beta[arm] += 1

true_probs = [0.05, 0.07, 0.10]
bandit = ThompsonBandit(3)
for _ in range(10000):
    arm = bandit.select_arm()
    reward = np.random.binomial(1, true_probs[arm])
    bandit.update(arm, reward)

Key Assumptions & Pitfalls

Bayesian results depend on the prior — with enough data, prior washes out. Beta(1,1) is standard.
Thompson sampling converges to the optimal arm but initial exploration means some "wasted" traffic
Bandits sacrifice clean causal inference — can't compute unbiased ATEs from bandit data
Use A/B tests for causal inference; use bandits for optimizing cumulative reward
Contextual bandits require well-chosen features — bad features = worse than non-contextual
Expected loss is often more useful than P(best) for business decisions

Foundations

Distributions & Foundational Statistics

The probability distributions and core concepts that underpin everything else.

Key Distributions & CLT

Normal: f(x) = (1/σ√2π)e^(-(x-μ)²/2σ²) | CLT: X̄ ~ N(μ, σ²/n)

MLE: θ̂ = argmax Π P(xᵢ|θ). Distributions generate data. CLT enables inference. MLE fits distributions.

Statistical Perspective

Normal: parameterized by μ, σ. 68-95-99.7 rule. Binomial: count of successes in n Bernoulli trials. Poisson: rare event counts with rate λ (mean = variance = λ). Exponential: time between events, rate λ, memoryless.

The CLT states that sample means approach normality regardless of the underlying distribution (given independence and sufficient n). This is why t-tests work even on non-normal data.

MLE finds parameters maximizing P(data|θ). For normal: μ̂ = x̄, σ̂² = Σ(xᵢ-x̄)²/n. Correlation normalizes covariance to [-1,1]. R² = ρ² for simple regression — fraction of variance explained.

Intuitive Perspective

Distributions are templates for randomness. Heights → normal (bell curve). Customer calls/hour → Poisson (rare events). Time between bus arrivals → exponential (longer waits less likely).

The CLT is the most important theorem in statistics: average enough of ANY data, and the averages look like a bell curve. This is why so many methods assume normality — not because raw data is normal, but because averages are.

Correlation measures how much two things move together linearly. R² tells you what percentage of one variable's variation is explained by the other. But correlation ≠ causation, and zero correlation ≠ independence (could have non-linear relationship).

Uber Trip Duration Modeling

Business Context: Trip durations follow a log-normal distribution (right-skewed). Mean: 18 min, median: 14 min. Using normal naively would underestimate long trip probability.

Approach: MLE fits log-normal parameters for ETA prediction. For A/B test analysis, CLT ensures mean trip duration across thousands of trips is approximately normal, even though individual trips are skewed.

Intuit Support Call Volume

Business Context: TurboTax support calls follow a Poisson process. MLE: λ_weekday ≈ 1,200/hr, λ_peak ≈ 3,400/hr. Used for staffing models.

Validation: Poisson assumption checked by verifying mean ≈ variance in each time window. Exponential inter-arrival times: mean = 60/3400 ≈ 1.06 seconds during peak. Deviations from Poisson signal unusual behavior (e.g., system outage creating call bursts).

Interactive: Distribution Explorer

μ (mean): 0

σ (std dev): 1.0

Mean: 0 | Variance: 1

Python Example

# Distribution Fundamentals
import numpy as np
from scipy import stats

# Normal distribution
x = np.linspace(-4, 4, 200)
pdf = stats.norm.pdf(x, 0, 1)  # 68-95-99.7 rule

# Binomial: n trials, p success probability
pmf = stats.binom.pmf(np.arange(21), n=20, p=0.3)
print(f"Binomial(20,0.3): mean={20*0.3}, var={20*0.3*0.7:.1f}")

# Poisson: λ events per unit time
pmf = stats.poisson.pmf(np.arange(20), mu=5)
print(f"Poisson(5): mean=5, var=5")  # mean = variance!

# MLE for normal
data = np.random.normal(5, 2, 1000)
mu_mle, sigma_mle = stats.norm.fit(data)

# CLT demonstration
means = [np.random.exponential(2, 50).mean() for _ in range(10000)]
# means are approximately normal even though exponential is skewed!

# Correlation and R²
r = np.corrcoef(x, y)[0,1]
print(f"Correlation: {r:.3f}, R²: {r**2:.3f}")

Key Assumptions & Pitfalls

Normal: 68-95-99.7 rule. CLT makes it the default for sample means.
Binomial requires: fixed n, independent trials, constant p. Approximate with normal when np ≥ 5 and n(1-p) ≥ 5.
Poisson: events independent, constant rate. Var = Mean — check this in your data!
Exponential: memoryless — P(T > t+s | T > t) = P(T > s). Only distribution with this property.
Correlation measures LINEAR relationships only — zero correlation ≠ independence
MLE is consistent and efficient, but can overfit with small samples — consider regularization

Machine Learning

Regression & Core ML

From linear models to regularization — the workhorses of predictive analytics.

Regression Models

Linear predicts values. Logistic predicts probabilities. Regularization prevents overfitting.

Statistical Perspective

Linear regression minimizes squared residuals. Assumptions: linearity, independence, homoscedasticity, normal residuals. Check via: residual plots, Q-Q plots, VIF (> 10 = multicollinearity), Cook's distance (outliers).

Logistic regression models log-odds as linear: log(p/(1-p)) = Xβ. Uses MLE. Output is probability via sigmoid σ(z) = 1/(1+e^{-z}).

Ridge (L2) shrinks all coefficients toward zero — good when all features are relevant. Lasso (L1) sets some coefficients exactly to zero — performs feature selection. Elastic Net combines both. Tune λ via cross-validation.

Evaluate regression with R² (adjusted R²); classification with ROC/AUC, precision, recall, F1. k-fold cross-validation gives honest out-of-sample performance.

Intuitive Perspective

Linear regression = drawing the best straight line through a scatter plot. "Best" = minimizes total prediction error squared.

Logistic regression predicts probabilities using an S-shaped curve (sigmoid) that squishes any number into 0-1. "What's the chance this customer churns?"

Regularization = adding a penalty for complexity. Ridge says "keep all coefficients small" (budget for each feature). Lasso says "some features don't matter — set them to zero." Use Ridge when all features contribute; Lasso when you suspect many are noise.

Uber Driver Churn Prediction

Model: Logistic regression with L1 (Lasso) to select top predictors from 50+ candidates. AUC = 0.81.

Top predictors: Declining trip frequency (β=-1.2), earnings below city median (β=0.8), account age < 90 days (β=0.6). 12 features retained out of 50+.

Usage: Model triggers retention interventions (bonus offers, personalized messages) for high-risk drivers.

Intuit Revenue per Return Prediction

Model: Ridge regression (L2) because all features are meaningful and correlated. R² = 0.73.

Diagnostics: Residual analysis revealed heteroscedasticity (variance increases with return complexity) — fixed with log-transform of revenue.

Usage: LTV modeling and staffing allocation. Knowing expected revenue per return informs which users to prioritize for expert assistance.

Interactive: Regularization — Ridge vs Lasso

Lambda (λ): 1.0

Left: Ridge (L2) — coefficients shrink but never reach zero. Right: Lasso (L1) — some hit zero (feature selection).

Python Example

# Regression Diagnostics Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, classification_report
from scipy import stats
import numpy as np

# Linear Regression with diagnostics
model = LinearRegression().fit(X_train, y_train)
residuals = y_test - model.predict(X_test)
_, p_normal = stats.shapiro(residuals[:500])  # Normality check

# Multicollinearity (VIF > 10 = problem)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X_train, i) for i in range(X_train.shape[1])]

# Logistic Regression with L1
log_model = LogisticRegression(penalty='l1', C=1.0, solver='saga')
log_model.fit(X_train, y_train)
auc = roc_auc_score(y_test, log_model.predict_proba(X_test)[:,1])

# Regularization comparison
for name, m in [("Ridge", Ridge(1.0)), ("Lasso", Lasso(0.1))]:
    scores = cross_val_score(m, X_train, y_train, cv=5, scoring='r2')
    m.fit(X_train, y_train)
    n_zero = (np.abs(m.coef_) < 1e-6).sum()
    print(f"{name}: R²={scores.mean():.3f}, zero coefs={n_zero}")

Key Assumptions & Pitfalls

Linear regression: linearity, independence, homoscedasticity, normal residuals — check all four
Logistic regression: no multicollinearity, linear relationship between features and log-odds
Ridge when all features matter; Lasso for feature selection; Elastic Net when unsure
Regularization λ must be tuned via cross-validation, not guessed
AUC is the primary metric for imbalanced classification; F1 when balancing precision/recall
R² can be misleading — always check residual plots for patterns

Machine Learning

Tree-Based Models & Ensembles

From single decision trees to XGBoost — the most powerful off-the-shelf predictors.

Tree Splitting & Boosting

Gini = 1 - Σpᵢ² | Boosting: F_m(x) = F_{m-1}(x) + η·h_m(x)

Trees split on Gini/entropy. Random Forest bags them. Boosting learns from residuals sequentially.

Statistical Perspective

Decision trees partition features via greedy recursive binary splitting on Gini impurity or information gain. Pruning (cost-complexity) prevents overfitting.

Random Forest: bootstrap samples + random feature subsets → many decorrelated trees → average. Reduces variance via bagging. OOB error gives free cross-validation.

Gradient Boosting: sequentially fit trees to negative gradient (residuals). Learning rate η controls step size. XGBoost adds L1/L2 regularization on leaf weights, second-order gradients, efficient sparse handling. AdaBoost: sequential stumps where misclassified samples get upweighted.

Intuitive Perspective

A decision tree plays 20 questions: "Income > $50K? → Age > 35? → Predict: will buy." Simple but overfits.

Random Forest is "wisdom of crowds" — 500 mediocre trees vote, and their average is excellent because errors cancel out.

Gradient Boosting is an iterative coach: first model predicts, second corrects the first's mistakes, third corrects what's still wrong. Each is weak but the team is strong. XGBoost = gradient boosting with all the engineering optimizations — the Kaggle king.

Uber Fraud Detection — XGBoost

Features: Device fingerprint, GPS jitter, payment method age, trip pattern anomaly score. XGBoost with class weights (fraud ≈ 0.5% of trips).

Results: AUC=0.96, precision=0.82 at recall=0.90. Top feature importance: GPS jitter (21%), device mismatch (18%), payment velocity (15%). Beats logistic regression (AUC=0.88) and RF (AUC=0.93).

Intuit Product Tier Recommendation — RF

Task: Predict which TurboTax tier (Free/Deluxe/Premier/SE) each user needs. Random Forest chosen for interpretability.

Results: OOB accuracy: 78%. Feature importance: prior year forms (32%), W-2 count (18%), investment income flag (15%). Routes users to the right product during onboarding — reducing downgrades by 12%.

Model Comparison

Model	Approach	Strengths	Weaknesses	When to Use
Decision Tree	Single recursive split	Interpretable, fast	Overfits, high variance	Explainability needed
Random Forest	Bagging + feature randomness	Robust, parallel, OOB error	Less interpretable, memory	Default first try
Gradient Boosting	Sequential residual fitting	Most accurate	Slow, overfits if not tuned	Accuracy matters most
XGBoost	GB + regularization	Fast, regularized, sparse	Many hyperparams	Production, large data
AdaBoost	Sequential weighted stumps	Simple, less overfitting	Sensitive to noise	Quick baseline

Interactive: Bias-Variance Tradeoff

Max Tree Depth: 3

Training error always decreases. Validation error has a U-shape — the sweet spot minimizes generalization error.

Python Example

# Tree-Based Models Comparison
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=5),
    "Random Forest": RandomForestClassifier(n_estimators=200, max_depth=10),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, lr=0.1),
}
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    print(f"{name}: AUC={scores.mean():.3f} ± {scores.std():.3f}")

# Feature importance
rf = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)
for name, imp in sorted(zip(features, rf.feature_importances_), key=lambda x: -x[1])[:5]:
    print(f"  {name}: {imp:.3f}")

Key Assumptions & Pitfalls

Decision trees are greedy (locally optimal) — may miss globally optimal partitions
Random Forest reduces variance via averaging; does NOT reduce bias of individual trees
Gradient Boosting reduces both bias and variance but overfits without regularization
XGBoost's scale_pos_weight handles class imbalance — set to (neg count / pos count)
Learning rate and n_estimators are inversely related — lower rate + more trees = better generalization
Feature importance from trees measures predictive contribution, NOT causal effect

ML & Special Topics

Clustering, Dimensionality Reduction & Special Topics

Unsupervised learning, anomaly detection, and common interview gotchas.

Unsupervised Methods

K-means: min Σ||xᵢ - μⱼ||² | PCA: W = argmax Var(WᵀX), WᵀW = I

K-means finds groups. PCA finds directions of maximum variance. Both are unsupervised.

Statistical Perspective

K-means: iteratively assign points to nearest centroid, recompute centroids. Sensitive to init (use k-means++) and k (elbow/silhouette). Hierarchical: agglomerative merging via linkage criteria (Ward's, complete). DBSCAN: density-based, handles arbitrary shapes, no k needed but sensitive to eps/minPts.

PCA: eigendecomposition of covariance matrix → orthogonal directions of max variance. First PC captures most, second captures most of remainder, etc. Always standardize first.

Special topics: Simpson's paradox (aggregate trend reverses at segment level). DAGs: colliders should NOT be conditioned on (opens backdoor); confounders SHOULD be. KNN: non-parametric, lazy learner. Naive Bayes: assumes feature independence. SVM: kernel trick for non-linear boundaries.

Intuitive Perspective

K-means = assigning students to study groups: find group centers, assign each to nearest, move centers, repeat. DBSCAN = finding crowds in a park: dense areas are clusters, loners are outliers.

PCA = rotating your head to find the angle that shows the most variation in a 3D scatter plot, then taking a photo from that angle for a 2D picture with maximum information.

Simpson's Paradox: Berkeley admitted more men overall, but within each department women had equal or higher rates — women applied more to competitive departments. Always segment before concluding.

Uber Rider Segmentation — K-Means + PCA

Approach: 5M riders. 15 features → PCA to 5 components (82% variance). K-means k=6 via silhouette score.

Segments: "Power Commuters" (high freq, short distance), "Weekend Warriors" (low freq, long distance), "Price Sensitives" (cancel at surge > 1.5×), etc. Each gets different marketing.

Intuit Anomaly Detection + Simpson's Paradox

Anomaly Detection: Rolling z-scores on filing speed, deduction amounts, refund-to-income ratios. |z| > 3 flags for review. DBSCAN clusters suspicious returns sharing patterns.

Simpson's Paradox: Overall CSAT dropped QoQ, but within each tier it went UP — more users shifted to Free tier (lower baseline CSAT). Segmentation revealed the true story.

DAGs & Colliders

Confounder Z → X and Z → Y: CONTROL for Z (include in regression).

Mediator X → M → Y: Do NOT control for M if you want total effect of X.

Collider X → C ← Y: Do NOT condition on C — creates spurious association. Example: "Disease → Hospitalization ← Injury." In hospital data, disease and injury appear correlated — but only because both cause hospitalization.

Special Topics

Painted Door Experiments: Show a button for a feature that doesn't exist. Measure clicks. If high → demand exists. Intuit: "AI Tax Advisor" button shown to 5%, 23% clicked → validated demand before 3-month build.

Learning Effects: Novelty (users try new thing, engagement inflates then fades) vs. change aversion (engagement dips then recovers). Test on new users only to separate effects.

Imbalanced Datasets: SMOTE (oversample minority in feature space), Tomek links (undersample majority near boundary), cost-sensitive learning. Never evaluate on resampled data.

Interactive: K-Means Clustering

Iteration: 0 | Click "Step" to run one K-means iteration (assign → update centroids)

Python Example

# Clustering, PCA & Special Topics
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# PCA
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced {X.shape[1]} → {X_pca.shape[1]} components")

# K-Means with elbow
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=10).fit(X_pca)
    print(f"k={k}: silhouette={silhouette_score(X_pca, km.labels_):.3f}")

# DBSCAN
db = DBSCAN(eps=0.5, min_samples=5).fit_predict(X_pca)
print(f"Clusters: {len(set(db))-1}, Outliers: {(db==-1).sum()}")

# Imbalanced dataset handling
from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X_train, y_train)
# Never evaluate on resampled data!

Key Assumptions & Pitfalls

K-means assumes spherical clusters of similar size — use DBSCAN for arbitrary shapes
Always scale features before PCA or clustering (StandardScaler)
DBSCAN eps: use k-distance plot (sort k-NN distances, look for elbow)
Simpson's paradox: ALWAYS segment by key confounders before drawing aggregate conclusions
DAGs: conditioning on a collider CREATES bias; conditioning on a confounder REMOVES bias
Never evaluate models on resampled (SMOTE) data — only resample the training set

Stats & ML Guide

Statistics & ML Concepts Guide

Topics Covered

Statistical Foundations

Experimental Design

Advanced Methods

Machine Learning

Hypothesis Testing Foundations

Statistical Perspective

Intuitive Perspective

Uber Dispatch Algorithm A/B Test

Intuit Default Browser Prompt A/B Test

Interactive: How Sample Size Affects the t-Test

Python Example

Key Assumptions & Pitfalls

Confidence Intervals

Statistical Perspective

Intuitive Perspective

Uber Surge Pricing Elasticity

Intuit Filing Completion Rate CI

Interactive: CI Width vs Sample Size

Python Example

Key Assumptions & Pitfalls

Statistical Power & Sample Size

Statistical Perspective

Intuitive Perspective

Uber ETA Model Rollout

Intuit Tax Filing Nudge Campaign

Interactive: Power Calculator

Python Example

Key Assumptions & Pitfalls

Multiple Testing & FDR Control

Statistical Perspective

Intuitive Perspective

Uber Multi-Variant Rider App Experiment

Intuit Tax Season Feature Launch

Interactive: Multiple Testing — False Positive Accumulation

Python Example

Key Assumptions & Pitfalls

Sequential Testing

Statistical Perspective

Intuitive Perspective

Uber Surge Pricing Algorithm Monitoring

Intuit Tax Season Feature Gate

Interactive: Peeking Inflates False Positives

Python Example

Key Assumptions & Pitfalls

Variance Reduction — CUPED & Delta Method

Statistical Perspective

Intuitive Perspective

Uber Revenue per Trip — Delta Method

Intuit CUPED for Filing Completion

Interactive: CUPED Variance Reduction

Python Example

Key Assumptions & Pitfalls

Bayesian Methods & Multi-Armed Bandits

Statistical Perspective

Intuitive Perspective

Uber Dynamic Pricing — Contextual Bandits

Intuit Onboarding Flow — Bayesian A/B

Interactive: Thompson Sampling Simulation

Python Example

Key Assumptions & Pitfalls

Distributions & Foundational Statistics

Statistical Perspective

Intuitive Perspective

Uber Trip Duration Modeling

Intuit Support Call Volume

Interactive: Distribution Explorer

Python Example

Key Assumptions & Pitfalls

Regression & Core ML

Statistical Perspective

Intuitive Perspective

Uber Driver Churn Prediction

Intuit Revenue per Return Prediction

Interactive: Regularization — Ridge vs Lasso

Python Example

Key Assumptions & Pitfalls

Tree-Based Models & Ensembles