Interactive study guide consolidating interview-level concepts — from hypothesis testing to XGBoost, with case studies at Uber & Intuit.
| Section | Category | Key Topics |
|---|---|---|
| 1. Hypothesis Testing | Foundations | t-test, z-test, Mann-Whitney, p-values, Type I/II error |
| 2. Confidence Intervals | Foundations | t-based CI, proportion CI, Wilson score, bootstrap CI |
| 3. Power & Sample Size | Experimental Design | Power analysis, Cohen's d, MDE, sample size calculation |
| 4. Multiple Testing | Experimental Design | Bonferroni, Benjamini-Hochberg, ANOVA, chi-squared |
| 5. Sequential Testing | Advanced Experimentation | mSPRT, always-valid p-values, O'Brien-Fleming, peeking |
| 6. Variance Reduction | Advanced Experimentation | CUPED, delta method, block bootstrap |
| 7. Bayesian & Bandits | Advanced Experimentation | Posterior inference, Thompson sampling, UCB, contextual bandits |
| 8. Distributions | Foundations | Normal, binomial, Poisson, exponential, CLT, MLE |
| 9. Regression & Core ML | Machine Learning | Linear/logistic regression, Ridge, Lasso, ROC/AUC |
| 10. Trees & Ensembles | Machine Learning | Decision trees, Random Forest, XGBoost, AdaBoost |
| 11. Clustering & Special | ML & Special Topics | K-means, PCA, DBSCAN, DAGs, Simpson's paradox, imbalanced data |
Hypothesis testing, confidence intervals, power analysis, distributions — the building blocks of statistical inference that every DS interview tests.
Multiple testing correction, sequential tests, variance reduction — running experiments correctly at scale, the way Uber and Intuit do it.
Bayesian inference, multi-armed bandits, causal reasoning — going beyond basic A/B testing into adaptive and observational methods.
Regression, tree-based models, clustering, dimensionality reduction — predictive modeling from linear models to XGBoost.
The bedrock of statistical inference — from null hypotheses to non-parametric alternatives.
Hypothesis testing is a decision framework. We assume H₀ (no effect) and compute P(data | H₀). The p-value is the probability of observing results as extreme as ours if H₀ is true. If p < α (typically 0.05), we reject H₀.
The t-statistic measures the signal-to-noise ratio — how many standard errors the observed difference is from zero. Welch's t-test (default in scipy) doesn't assume equal variance. For non-normal data or small samples, Mann-Whitney U provides a non-parametric alternative that compares rank orderings rather than means.
Always check assumptions: independence (each observation in one group only), approximate normality (Shapiro-Wilk test, Q-Q plots), or sufficient n for CLT (n > 30). The z-test for proportions uses SE = √(p(1-p)/n) where variance is fully determined by p itself — no need to estimate it from the sample.
Imagine flipping a coin 100 times and getting 60 heads. Is the coin unfair, or did you just get lucky? That's hypothesis testing — we start by assuming nothing special is happening (the null hypothesis) and ask "how surprising is this result?"
The p-value answers: "If the coin WERE fair, what's the chance of seeing 60+ heads?" If that chance is tiny (< 5%), we conclude the coin is probably unfair.
The t-test applies this logic to comparing two groups — it asks whether the difference between group averages is too large to be explained by random variation alone. Key insight: a significant p-value means "unlikely under the null." It does NOT tell you the effect is large or practically meaningful.
Business Context: Testing whether Smart Match v2 (considering driver-rider compatibility) improves ride completion rate. Two-sample t-test on completion rates between treatment and control.
Setup: n=400K riders, 2-week test. Welch's t-test used because variance differs across rider segments (power commuters vs. casual users).
Results: Treatment mean 0.31pp higher, t=1.35, p=0.18. Supplemented with Mann-Whitney U (p=0.21) — both tests agree: not significant without variance reduction. This motivated applying CUPED (Section 6), which reduced SE enough to reach significance.
Takeaway: When both parametric and non-parametric tests agree, the conclusion is robust. The effect was real but masked by noise — a common scenario in marketplace experiments.
Business Context: Testing whether a redesigned prompt increases default browser switching rate. Binary outcome (switched / didn't).
Setup: Z-test for proportions since outcome is binary. p_treatment=12.3%, p_control=11.8%. Assumptions: np ≥ 5 ✓, random assignment ✓.
Results: z=1.89, p=0.059. Not significant at α=0.05 but close. Team runs Mann-Whitney as robustness check, decides to extend test 1 more week rather than ship on a borderline result.
Takeaway: Borderline results (0.05 < p < 0.10) warrant extending the test, not making a binary ship/no-ship decision. Always compute effect size alongside the p-value.
# Two-Sample Hypothesis Testing Workflow import numpy as np from scipy import stats # Step 1: Explore data control = df[df['group'] == 'control']['metric'].values treatment = df[df['group'] == 'treatment']['metric'].values print(f"Control: mean={control.mean():.3f}, std={control.std():.3f}, n={len(control)}") print(f"Treatment: mean={treatment.mean():.3f}, std={treatment.std():.3f}, n={len(treatment)}") # Step 2: Check normality _, p_shapiro = stats.shapiro(control[:500]) # Shapiro sensitive at large n print(f"Shapiro-Wilk p={p_shapiro:.4f}") # Step 3: Welch's t-test (default — doesn't assume equal variance) t_stat, p_value = stats.ttest_ind(treatment, control, equal_var=False) print(f"Welch's t-test: t={t_stat:.3f}, p={p_value:.4f}") # Step 4: Non-parametric alternative u_stat, p_mw = stats.mannwhitneyu(treatment, control, alternative='two-sided') print(f"Mann-Whitney U: U={u_stat:.0f}, p={p_mw:.4f}") # Step 5: Effect size (Cohen's d) pooled_std = np.sqrt((control.std()**2 + treatment.std()**2) / 2) cohens_d = (treatment.mean() - control.mean()) / pooled_std print(f"Cohen's d: {cohens_d:.3f}") # Step 6: Z-test for proportions (binary outcomes) from statsmodels.stats.proportion import proportions_ztest z_stat, p_prop = proportions_ztest([conv_treat, conv_ctrl], [n_treat, n_ctrl])
Quantifying uncertainty — how precise is your estimate?
A confidence interval provides a range of plausible values for an unknown parameter. The frequentist interpretation: if we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter.
For means, SE = s/√n and we use the t-distribution (converges to z for large n). For proportions, SE = √(p(1-p)/n) and variance is fully determined by p, so we use z directly.
The Wilson score interval improves on the normal approximation for extreme proportions (p < 0.01 or p > 0.99) or small n — it respects the [0,1] boundary and is asymmetric. Bootstrap CIs (resample 10K times, take percentiles) require no distributional assumptions and work for any statistic.
Think of a confidence interval as a "net" you cast to catch the true value. A 95% CI means if you went fishing 100 times with the same net, you'd catch the fish 95 times.
The net's width depends on two things: how noisy your data is (more noise = wider net) and how much data you have (more data = narrower net).
A CI that doesn't include zero for a treatment effect means "we're confident there IS an effect." A wide CI means "there's probably an effect, but we're not sure how big." If the CI is entirely above zero but narrow — that's the best case: confident effect with a precise estimate.
Business Context: Estimating price elasticity of demand for rides. Point estimate: -0.35 (a 10% price increase reduces demand by 3.5%).
Results: 95% CI: [-0.42, -0.28]. CI is narrow because n=2M rides. The entire CI is negative, confirming demand falls with price.
Business Impact: At the CI lower bound (-0.42), a 20% surge still reduces demand by only 8.4%, justifying current surge levels. The narrow CI gave leadership confidence to maintain the pricing strategy.
Business Context: Measuring completion rate for a new W-2 import flow.
Results: Treatment: 73.2% (95% CI: [72.1%, 74.3%]). Control: 71.4% (95% CI: [70.3%, 72.5%]). CIs overlap slightly, but z-test on the difference shows p=0.008.
Key Lesson: Overlapping CIs do NOT mean no significant difference — the CI on the difference is [0.5pp, 3.1pp], entirely above zero. Always test the difference directly.
# Confidence Intervals — Means vs Proportions import numpy as np from scipy import stats from statsmodels.stats.proportion import proportion_confint # CI for continuous data (t-based) mean_diff = treatment.mean() - control.mean() se_diff = np.sqrt(treatment.var()/len(treatment) + control.var()/len(control)) ci_t = (mean_diff - 1.96 * se_diff, mean_diff + 1.96 * se_diff) # CI for proportions (normal approximation) p1, n1 = 0.732, 5000 se1 = np.sqrt(p1 * (1 - p1) / n1) ci_normal = (p1 - 1.96 * se1, p1 + 1.96 * se1) # Wilson score interval (better for small n or extreme p) ci_wilson = proportion_confint(int(p1 * n1), n1, alpha=0.05, method='wilson') # Bootstrap CI (works for any statistic) boot_means = [np.random.choice(data, size=len(data), replace=True).mean() for _ in range(10_000)] ci_boot = np.percentile(boot_means, [2.5, 97.5])
How big a sample do you need to detect an effect? The most important pre-experiment question.
Power = P(reject H₀ | H₁ is true) = 1 - β. It depends on four quantities: α (significance level), n (sample size), σ (noise), and δ (true effect size).
Pre-experiment, you fix α (usually 0.05), target power (usually 0.80), estimate σ from historical data, and choose MDE (minimum detectable effect) from the business — then solve for n. Cohen's d = δ/σ standardizes effect size: 0.2 = small, 0.5 = medium, 0.8 = large.
For proportions, variance is p(1-p), so metrics near 50% need larger samples than those near 5% or 95%. Variance reduction techniques like CUPED effectively increase power without more data.
Imagine trying to hear someone whisper in a noisy room. Power is the chance you'll actually hear them. To increase that chance:
Power analysis tells you: "How long do we need to run this test before we can reliably detect the improvement we care about?" An underpowered test is a waste — you run it but can't detect the effect.
Business Context: New ML model expected to reduce ETA error by 3%. Historical σ = 12%, so Cohen's d = 0.25 (small effect).
Power Analysis: At α=0.05, power=0.80: need n=252 per group. With city-level clustering (design effect ≈ 1.5), effective n = 378 per group. Test duration: 3 days at current traffic.
Takeaway: Small effects require careful planning. The clustering adjustment (design effect) nearly doubled the required sample — ignoring it would have led to an underpowered test.
Business Context: Testing whether an email nudge increases filing starts by 2pp (from 45% to 47%).
Power Analysis: σ = √(0.45×0.55) = 0.497. At α=0.05, power=0.80: n=3,920 per group → 1 week. Team decides MDE=1pp for more sensitivity → n=15,680 per group → 3 weeks.
Takeaway: Halving the MDE quadruples the sample size. The team had to trade sensitivity for speed during the 12-week tax season.
# Power Analysis & Sample Size Calculation from scipy import stats import numpy as np def required_sample_size(alpha, power, effect_size): """Calculate required n per group for a two-sample t-test.""" z_alpha = stats.norm.ppf(1 - alpha / 2) z_beta = stats.norm.ppf(power) n = 2 * ((z_alpha + z_beta) / effect_size) ** 2 return int(np.ceil(n)) # Example: detect 2pp lift in filing completion baseline_rate = 0.45 mde = 0.02 sigma = np.sqrt(baseline_rate * (1 - baseline_rate)) cohens_d = mde / sigma n_required = required_sample_size(0.05, 0.80, cohens_d) print(f"Cohen's d: {cohens_d:.3f}") print(f"Required n per group: {n_required:,}") # Using statsmodels (more precise) from statsmodels.stats.power import TTestIndPower analysis = TTestIndPower() n_precise = analysis.solve_power(effect_size=cohens_d, alpha=0.05, power=0.80) print(f"Statsmodels n per group: {int(np.ceil(n_precise)):,}")
When you test many hypotheses, false positives accumulate — here's how to control them.
With m independent tests at α=0.05, P(≥1 false positive) = 1 - (1-α)^m. For m=20, that's 64%. Bonferroni divides α by m — simple but very conservative, especially for large m.
Benjamini-Hochberg (BH) controls the False Discovery Rate: the expected proportion of rejected hypotheses that are false. Procedure: rank p-values, compare each p_(i) to (i/m)·α, reject all up to the largest that passes. BH is uniformly more powerful than Bonferroni.
Use Bonferroni for few high-stakes tests (2-3 variants). Use BH for many tests (feature screening, multi-metric dashboards). Always correct across ALL tests you ran, not just the ones with small p-values.
Imagine checking 20 different metrics for an A/B test. Even if the treatment does nothing, you'd expect 1 metric to show p < 0.05 by pure chance (20 × 0.05 = 1). That's the multiple testing problem.
Bonferroni is the strict parent: "You can only claim significance if a result would be surprising even accounting for ALL the doors you opened."
BH is the pragmatic boss: "I accept that some findings might be false, but I want to keep that proportion low." In practice, BH finds more true effects while keeping the false discovery rate at 5%.
Business Context: Testing 4 new UI variants for the ride request flow, measuring CTR, completion rate, and time-to-book. 4 variants × 3 metrics = 12 tests.
Results: Without correction: 3 show p < 0.05. Bonferroni (α = 0.05/12 = 0.0042): only 1 survives. BH (FDR = 0.05): 2 survive.
Decision: Team uses BH as primary analysis, confirms surviving results with a focused follow-up test. The test that passed Bonferroni gets immediate rollout.
Business Context: Launching 3 features simultaneously, each with its own A/B test. Primary metric: filing completion.
Results: With Bonferroni threshold 0.05/3 = 0.017: Feature A (p=0.003) ✓, Feature B (p=0.032) ✗, Feature C (p=0.41) ✗.
Decision: Ship A confidently. Extend B's test for more power (it would pass BH but not Bonferroni — the risk tolerance decision is the team's). Kill C.
# Multiple Testing Correction Workflow from scipy import stats from statsmodels.stats.multitest import multipletests import numpy as np # 4 variants × 3 metrics = 12 pairwise tests p_values = [0.003, 0.012, 0.032, 0.048, 0.051, 0.089, 0.12, 0.23, 0.34, 0.41, 0.67, 0.91] # Bonferroni correction (FWER control) reject_bonf, pvals_bonf, _, _ = multipletests(p_values, method='bonferroni') print("Bonferroni rejections:", sum(reject_bonf)) # Benjamini-Hochberg (FDR control) reject_bh, pvals_bh, _, _ = multipletests(p_values, method='fdr_bh') print("BH rejections:", sum(reject_bh)) # Chi-squared omnibus → pairwise z-tests chi2, p_omnibus, dof, expected = stats.chi2_contingency(observed_table) if p_omnibus < 0.05: # Pairwise z-tests with BH correction from statsmodels.stats.proportion import proportions_ztest pairwise_p = [] for variant in variants: z, p = proportions_ztest([variant.conv, ctrl.conv], [variant.n, ctrl.n]) pairwise_p.append(p) reject, adj_p, _, _ = multipletests(pairwise_p, method='fdr_bh')
How to monitor experiments without inflating false positives — peeking done right.
Traditional hypothesis tests assume a fixed sample size. If you peek at results during data collection and stop when p < 0.05, the actual false positive rate can be 20-30% (far above 5%).
mSPRT (mixture Sequential Probability Ratio Test) uses a likelihood ratio that remains a valid test statistic regardless of when you look. Group sequential designs (O'Brien-Fleming, Pocock) pre-specify a small number of interim analyses with adjusted significance boundaries.
Always-valid confidence sequences provide CIs valid at every time point simultaneously. Sequential tests trade power for flexibility — typically needing 20-30% more total samples than fixed-horizon tests.
Imagine betting on coin flips. If you keep flipping until you happen to be winning and then stop, you'll appear lucky even with a fair coin. That's what "peeking" does to A/B tests.
Sequential testing is like agreeing to rules BEFORE you start: "I can check after every 100 flips, but I need stronger evidence early on." Early checks require overwhelming evidence (p < 0.001), while later checks relax the threshold.
This way, even though you're peeking, your overall false positive rate stays at 5%. O'Brien-Fleming is the most popular approach — very conservative early, lenient late.
Business Context: Uber uses mSPRT for marketplace experiments where wrong decisions cost money in real-time.
Approach: New surge algorithm monitored daily with always-valid p-values. mSPRT allows stopping as soon as significance is reached without FPR inflation.
Results: Detected 1.2% improvement in driver utilization on day 5 (traditional test would require day 14). Estimated $2M additional revenue captured by shipping 9 days earlier.
Business Context: During the 12-week filing season, every week of testing = a week the winning variant isn't fully deployed.
Approach: Group sequential design with 3 interim looks (weeks 1, 2, 4) using O'Brien-Fleming bounds: α spent 0.0001, 0.005, 0.043.
Results: Feature showed strong signal at week 2 (p=0.002 < 0.005 boundary), shipped immediately — gaining 10 extra weeks of improved experience during peak season.
# Sequential Testing: Why Peeking Inflates FPR import numpy as np from scipy import stats def simulate_peeking_fpr(n_peeks, n_sims=10000, alpha=0.05): """Simulate false positive rate when peeking at results.""" false_pos = 0 n_per_peek = 100 z_crit = stats.norm.ppf(1 - alpha / 2) for _ in range(n_sims): a = np.random.normal(0, 1, n_per_peek * n_peeks) b = np.random.normal(0, 1, n_per_peek * n_peeks) for peek in range(1, n_peeks + 1): n = peek * n_per_peek z = abs((a[:n].mean() - b[:n].mean()) / np.sqrt(2/n)) if z > z_crit: false_pos += 1 break return false_pos / n_sims for peeks in [1, 2, 5, 10, 20, 50]: fpr = simulate_peeking_fpr(peeks) print(f"Peeks: {peeks:2d} → Actual FPR: {fpr:.1%}") # Peeks: 1 → 5.0% Peeks: 5 → 14.2% # Peeks: 10 → 19.3% Peeks: 50 → 32.1%
Make experiments more sensitive without collecting more data.
CUPED is a control variate method. By subtracting the predictable part of the outcome using a pre-experiment covariate X, we reduce variance by ρ²(X,Y). The adjustment is mean-zero (E[X − E[X]] = 0), so it doesn't bias the treatment effect.
The delta method approximates the variance of a function of random variables using a first-order Taylor expansion. For ratio metrics like revenue/trip: Var(Ŷ/X̄) ≈ (1/μ_X²)[Var(Y) − 2(μ_Y/μ_X)Cov(Y,X) + (μ_Y/μ_X)²Var(X)]/n.
Block bootstrap provides a non-parametric alternative when delta method's normality assumption is questionable. CUPED + delta method can be combined for ratio metrics with pre-period covariates.
CUPED is like adjusting for natural ability when grading students. If you know a student always scores 90%, seeing them score 92% in treatment is a 2-point surprise — that's the signal. Without CUPED, "92% vs 91%" is drowned in noise.
The delta method solves a different problem: what if your metric is a ratio, like revenue per trip? You can't just average the per-user ratios — users with 1 trip have volatile ratios. The delta method tells you how uncertain the overall ratio is by propagating uncertainty from numerator and denominator.
The better you can predict baseline performance (higher ρ), the more noise CUPED strips away. ρ = 0.7 removes 49% of variance; ρ = 0.9 removes 81%.
Business Context: Testing a new pricing algorithm. Primary metric: revenue per completed trip (a ratio). Naive per-user ratio averaging gives volatile estimates.
Approach: Delta method: treat as ratio of means (total revenue / total trips). SE reduced by 40% vs. naive averaging. Combined with CUPED (pre-period revenue as covariate, ρ=0.78): total variance reduction = 73%.
Takeaway: For ratio metrics, always use delta method. Naive per-user ratios are dominated by low-denominator noise.
Business Context: Pre-experiment covariate: prior-year filing completion + number of forms (ρ = 0.74).
Results: CUPED variance reduction: 55%. Experiment that would need 3 weeks raw now reaches significance in ~1.3 weeks. During peak tax season, this saved ~12 days of suboptimal UX.
Takeaway: In tax filing, timing is everything. 12 weeks of filing season × $2M/week of delayed features = CUPED pays for itself immediately.
# CUPED & Delta Method import numpy as np # --- CUPED --- def cuped_adjust(y_post, x_pre): """Adjust outcomes using pre-experiment covariate.""" theta = np.cov(y_post, x_pre)[0,1] / np.var(x_pre) return y_post - theta * (x_pre - np.mean(x_pre)) np.random.seed(42) n = 10000 x_pre = np.random.normal(65, 20, n) treatment = np.random.binomial(1, 0.5, n) y_post = 0.7 * x_pre + treatment * 2.0 + np.random.normal(0, 15, n) y_adj = cuped_adjust(y_post, x_pre) print(f"Raw SE: {y_post.std()/np.sqrt(n/2):.3f}") print(f"CUPED SE: {y_adj.std()/np.sqrt(n/2):.3f}") print(f"Variance reduction: {1 - y_adj.var()/y_post.var():.1%}") # --- Delta Method for Ratio Metrics --- def delta_method_ratio_se(y, x): """SE of mean(y)/mean(x) via delta method.""" n = len(y) mu_y, mu_x = y.mean(), x.mean() var_y, var_x = y.var(), x.var() cov_yx = np.cov(y, x)[0,1] return np.sqrt((var_y - 2*(mu_y/mu_x)*cov_yx + (mu_y/mu_x)**2*var_x) / (mu_x**2*n))
From posterior beliefs to adaptive allocation — when exploration meets exploitation.
Bayesian A/B testing replaces p-values with posterior distributions. For binary outcomes, Beta-Bernoulli conjugacy gives closed-form updates: prior Beta(α₀, β₀) + s successes in n trials → posterior Beta(α₀+s, β₀+n-s).
Compute P(θ_B > θ_A | data) via simulation or closed-form. Expected loss = E[max(θ_A - θ_B, 0)] measures the cost of choosing B if A is actually better. Ship when P(best) > threshold AND expected loss is small.
Thompson Sampling: sample from each arm's posterior, pull the arm with the highest sample. Naturally balances exploration (uncertain arms get sampled widely) and exploitation (good arms get pulled more). UCB picks the arm with the highest upper confidence bound. Contextual bandits condition on user features for personalized allocation.
Frequentist testing asks: "Is B better? Yes or no." Bayesian asks: "HOW MUCH better is B, and how sure am I?" You start with a belief (prior), update with data, and get a posterior that directly says "94% chance B is better by 2-5%."
Bandits go further: instead of 50/50 traffic split the whole time, they gradually send more users to the winning variant. Like a restaurant trying new dishes — you taste-test, then shift to mostly serving the better one.
Key tradeoff: A/B tests give clean causal inference. Bandits optimize cumulative reward but sacrifice clean inference — you can't compute unbiased treatment effects from bandit data.
Business Context: Uber uses contextual MABs for surge pricing optimization. Context: time of day, location, demand/supply ratio, weather. Arms: different surge multiplier levels.
Results: Thompson sampling with logistic regression on context features outperforms fixed pricing rules by 4.3% in driver utilization. Bandits adapt to local conditions — surge that works in Manhattan at 6pm differs from Austin at 2am.
Tradeoff: Bandits sacrifice clean causal inference for better cumulative outcomes. For pricing, cumulative reward matters more than a clean ATE estimate.
Business Context: Testing 3 onboarding variants with Bayesian analysis. Prior: Beta(1,1) = uniform. After 10K users: P(V2 best) = 89%.
Decision rule: Ship when P(best) > 95% AND expected loss < 0.5pp. V2 reached both thresholds at 15K users — shipped 1 week earlier than a frequentist test would have allowed.
Takeaway: Bayesian analysis naturally answers "how confident are we?" without arbitrary α thresholds. Expected loss gives a direct business-relevant decision criterion.
Three arms with hidden true probabilities. Watch Thompson sampling discover the best arm.
| Arm | True p | Pulls | Successes | Est. Rate |
|---|---|---|---|---|
| A | ??? | 0 | 0 | - |
| B | ??? | 0 | 0 | - |
| C | ??? | 0 | 0 | - |
# Bayesian A/B Testing & Thompson Sampling import numpy as np # --- Bayesian A/B Test --- alpha_A, beta_A = 1 + 450, 1 + 5000 - 450 # Control: 9.0% alpha_B, beta_B = 1 + 520, 1 + 5000 - 520 # Treatment: 10.4% samples_A = np.random.beta(alpha_A, beta_A, 100000) samples_B = np.random.beta(alpha_B, beta_B, 100000) p_b_better = (samples_B > samples_A).mean() loss_B = np.maximum(samples_A - samples_B, 0).mean() print(f"P(B > A) = {p_b_better:.3f}") print(f"Expected loss of choosing B: {loss_B:.5f}") # --- Thompson Sampling --- class ThompsonBandit: def __init__(self, n_arms): self.alpha = np.ones(n_arms) self.beta = np.ones(n_arms) def select_arm(self): samples = [np.random.beta(a, b) for a, b in zip(self.alpha, self.beta)] return np.argmax(samples) def update(self, arm, reward): if reward: self.alpha[arm] += 1 else: self.beta[arm] += 1 true_probs = [0.05, 0.07, 0.10] bandit = ThompsonBandit(3) for _ in range(10000): arm = bandit.select_arm() reward = np.random.binomial(1, true_probs[arm]) bandit.update(arm, reward)
The probability distributions and core concepts that underpin everything else.
Normal: parameterized by μ, σ. 68-95-99.7 rule. Binomial: count of successes in n Bernoulli trials. Poisson: rare event counts with rate λ (mean = variance = λ). Exponential: time between events, rate λ, memoryless.
The CLT states that sample means approach normality regardless of the underlying distribution (given independence and sufficient n). This is why t-tests work even on non-normal data.
MLE finds parameters maximizing P(data|θ). For normal: μ̂ = x̄, σ̂² = Σ(xᵢ-x̄)²/n. Correlation normalizes covariance to [-1,1]. R² = ρ² for simple regression — fraction of variance explained.
Distributions are templates for randomness. Heights → normal (bell curve). Customer calls/hour → Poisson (rare events). Time between bus arrivals → exponential (longer waits less likely).
The CLT is the most important theorem in statistics: average enough of ANY data, and the averages look like a bell curve. This is why so many methods assume normality — not because raw data is normal, but because averages are.
Correlation measures how much two things move together linearly. R² tells you what percentage of one variable's variation is explained by the other. But correlation ≠ causation, and zero correlation ≠ independence (could have non-linear relationship).
Business Context: Trip durations follow a log-normal distribution (right-skewed). Mean: 18 min, median: 14 min. Using normal naively would underestimate long trip probability.
Approach: MLE fits log-normal parameters for ETA prediction. For A/B test analysis, CLT ensures mean trip duration across thousands of trips is approximately normal, even though individual trips are skewed.
Business Context: TurboTax support calls follow a Poisson process. MLE: λ_weekday ≈ 1,200/hr, λ_peak ≈ 3,400/hr. Used for staffing models.
Validation: Poisson assumption checked by verifying mean ≈ variance in each time window. Exponential inter-arrival times: mean = 60/3400 ≈ 1.06 seconds during peak. Deviations from Poisson signal unusual behavior (e.g., system outage creating call bursts).
# Distribution Fundamentals import numpy as np from scipy import stats # Normal distribution x = np.linspace(-4, 4, 200) pdf = stats.norm.pdf(x, 0, 1) # 68-95-99.7 rule # Binomial: n trials, p success probability pmf = stats.binom.pmf(np.arange(21), n=20, p=0.3) print(f"Binomial(20,0.3): mean={20*0.3}, var={20*0.3*0.7:.1f}") # Poisson: λ events per unit time pmf = stats.poisson.pmf(np.arange(20), mu=5) print(f"Poisson(5): mean=5, var=5") # mean = variance! # MLE for normal data = np.random.normal(5, 2, 1000) mu_mle, sigma_mle = stats.norm.fit(data) # CLT demonstration means = [np.random.exponential(2, 50).mean() for _ in range(10000)] # means are approximately normal even though exponential is skewed! # Correlation and R² r = np.corrcoef(x, y)[0,1] print(f"Correlation: {r:.3f}, R²: {r**2:.3f}")
From linear models to regularization — the workhorses of predictive analytics.
Linear regression minimizes squared residuals. Assumptions: linearity, independence, homoscedasticity, normal residuals. Check via: residual plots, Q-Q plots, VIF (> 10 = multicollinearity), Cook's distance (outliers).
Logistic regression models log-odds as linear: log(p/(1-p)) = Xβ. Uses MLE. Output is probability via sigmoid σ(z) = 1/(1+e^{-z}).
Ridge (L2) shrinks all coefficients toward zero — good when all features are relevant. Lasso (L1) sets some coefficients exactly to zero — performs feature selection. Elastic Net combines both. Tune λ via cross-validation.
Evaluate regression with R² (adjusted R²); classification with ROC/AUC, precision, recall, F1. k-fold cross-validation gives honest out-of-sample performance.
Linear regression = drawing the best straight line through a scatter plot. "Best" = minimizes total prediction error squared.
Logistic regression predicts probabilities using an S-shaped curve (sigmoid) that squishes any number into 0-1. "What's the chance this customer churns?"
Regularization = adding a penalty for complexity. Ridge says "keep all coefficients small" (budget for each feature). Lasso says "some features don't matter — set them to zero." Use Ridge when all features contribute; Lasso when you suspect many are noise.
Model: Logistic regression with L1 (Lasso) to select top predictors from 50+ candidates. AUC = 0.81.
Top predictors: Declining trip frequency (β=-1.2), earnings below city median (β=0.8), account age < 90 days (β=0.6). 12 features retained out of 50+.
Usage: Model triggers retention interventions (bonus offers, personalized messages) for high-risk drivers.
Model: Ridge regression (L2) because all features are meaningful and correlated. R² = 0.73.
Diagnostics: Residual analysis revealed heteroscedasticity (variance increases with return complexity) — fixed with log-transform of revenue.
Usage: LTV modeling and staffing allocation. Knowing expected revenue per return informs which users to prioritize for expert assistance.
# Regression Diagnostics Pipeline from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso from sklearn.model_selection import cross_val_score from sklearn.metrics import roc_auc_score, classification_report from scipy import stats import numpy as np # Linear Regression with diagnostics model = LinearRegression().fit(X_train, y_train) residuals = y_test - model.predict(X_test) _, p_normal = stats.shapiro(residuals[:500]) # Normality check # Multicollinearity (VIF > 10 = problem) from statsmodels.stats.outliers_influence import variance_inflation_factor vif = [variance_inflation_factor(X_train, i) for i in range(X_train.shape[1])] # Logistic Regression with L1 log_model = LogisticRegression(penalty='l1', C=1.0, solver='saga') log_model.fit(X_train, y_train) auc = roc_auc_score(y_test, log_model.predict_proba(X_test)[:,1]) # Regularization comparison for name, m in [("Ridge", Ridge(1.0)), ("Lasso", Lasso(0.1))]: scores = cross_val_score(m, X_train, y_train, cv=5, scoring='r2') m.fit(X_train, y_train) n_zero = (np.abs(m.coef_) < 1e-6).sum() print(f"{name}: R²={scores.mean():.3f}, zero coefs={n_zero}")
From single decision trees to XGBoost — the most powerful off-the-shelf predictors.
Decision trees partition features via greedy recursive binary splitting on Gini impurity or information gain. Pruning (cost-complexity) prevents overfitting.
Random Forest: bootstrap samples + random feature subsets → many decorrelated trees → average. Reduces variance via bagging. OOB error gives free cross-validation.
Gradient Boosting: sequentially fit trees to negative gradient (residuals). Learning rate η controls step size. XGBoost adds L1/L2 regularization on leaf weights, second-order gradients, efficient sparse handling. AdaBoost: sequential stumps where misclassified samples get upweighted.
A decision tree plays 20 questions: "Income > $50K? → Age > 35? → Predict: will buy." Simple but overfits.
Random Forest is "wisdom of crowds" — 500 mediocre trees vote, and their average is excellent because errors cancel out.
Gradient Boosting is an iterative coach: first model predicts, second corrects the first's mistakes, third corrects what's still wrong. Each is weak but the team is strong. XGBoost = gradient boosting with all the engineering optimizations — the Kaggle king.
Features: Device fingerprint, GPS jitter, payment method age, trip pattern anomaly score. XGBoost with class weights (fraud ≈ 0.5% of trips).
Results: AUC=0.96, precision=0.82 at recall=0.90. Top feature importance: GPS jitter (21%), device mismatch (18%), payment velocity (15%). Beats logistic regression (AUC=0.88) and RF (AUC=0.93).
Task: Predict which TurboTax tier (Free/Deluxe/Premier/SE) each user needs. Random Forest chosen for interpretability.
Results: OOB accuracy: 78%. Feature importance: prior year forms (32%), W-2 count (18%), investment income flag (15%). Routes users to the right product during onboarding — reducing downgrades by 12%.
| Model | Approach | Strengths | Weaknesses | When to Use |
|---|---|---|---|---|
| Decision Tree | Single recursive split | Interpretable, fast | Overfits, high variance | Explainability needed |
| Random Forest | Bagging + feature randomness | Robust, parallel, OOB error | Less interpretable, memory | Default first try |
| Gradient Boosting | Sequential residual fitting | Most accurate | Slow, overfits if not tuned | Accuracy matters most |
| XGBoost | GB + regularization | Fast, regularized, sparse | Many hyperparams | Production, large data |
| AdaBoost | Sequential weighted stumps | Simple, less overfitting | Sensitive to noise | Quick baseline |
# Tree-Based Models Comparison from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score from sklearn.metrics import roc_auc_score models = { "Decision Tree": DecisionTreeClassifier(max_depth=5), "Random Forest": RandomForestClassifier(n_estimators=200, max_depth=10), "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, lr=0.1), } for name, model in models.items(): scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc') print(f"{name}: AUC={scores.mean():.3f} ± {scores.std():.3f}") # Feature importance rf = RandomForestClassifier(n_estimators=200).fit(X_train, y_train) for name, imp in sorted(zip(features, rf.feature_importances_), key=lambda x: -x[1])[:5]: print(f" {name}: {imp:.3f}")
Unsupervised learning, anomaly detection, and common interview gotchas.
K-means: iteratively assign points to nearest centroid, recompute centroids. Sensitive to init (use k-means++) and k (elbow/silhouette). Hierarchical: agglomerative merging via linkage criteria (Ward's, complete). DBSCAN: density-based, handles arbitrary shapes, no k needed but sensitive to eps/minPts.
PCA: eigendecomposition of covariance matrix → orthogonal directions of max variance. First PC captures most, second captures most of remainder, etc. Always standardize first.
Special topics: Simpson's paradox (aggregate trend reverses at segment level). DAGs: colliders should NOT be conditioned on (opens backdoor); confounders SHOULD be. KNN: non-parametric, lazy learner. Naive Bayes: assumes feature independence. SVM: kernel trick for non-linear boundaries.
K-means = assigning students to study groups: find group centers, assign each to nearest, move centers, repeat. DBSCAN = finding crowds in a park: dense areas are clusters, loners are outliers.
PCA = rotating your head to find the angle that shows the most variation in a 3D scatter plot, then taking a photo from that angle for a 2D picture with maximum information.
Simpson's Paradox: Berkeley admitted more men overall, but within each department women had equal or higher rates — women applied more to competitive departments. Always segment before concluding.
Approach: 5M riders. 15 features → PCA to 5 components (82% variance). K-means k=6 via silhouette score.
Segments: "Power Commuters" (high freq, short distance), "Weekend Warriors" (low freq, long distance), "Price Sensitives" (cancel at surge > 1.5×), etc. Each gets different marketing.
Anomaly Detection: Rolling z-scores on filing speed, deduction amounts, refund-to-income ratios. |z| > 3 flags for review. DBSCAN clusters suspicious returns sharing patterns.
Simpson's Paradox: Overall CSAT dropped QoQ, but within each tier it went UP — more users shifted to Free tier (lower baseline CSAT). Segmentation revealed the true story.
Confounder Z → X and Z → Y: CONTROL for Z (include in regression).
Mediator X → M → Y: Do NOT control for M if you want total effect of X.
Collider X → C ← Y: Do NOT condition on C — creates spurious association. Example: "Disease → Hospitalization ← Injury." In hospital data, disease and injury appear correlated — but only because both cause hospitalization.
Painted Door Experiments: Show a button for a feature that doesn't exist. Measure clicks. If high → demand exists. Intuit: "AI Tax Advisor" button shown to 5%, 23% clicked → validated demand before 3-month build.
Learning Effects: Novelty (users try new thing, engagement inflates then fades) vs. change aversion (engagement dips then recovers). Test on new users only to separate effects.
Imbalanced Datasets: SMOTE (oversample minority in feature space), Tomek links (undersample majority near boundary), cost-sensitive learning. Never evaluate on resampled data.
# Clustering, PCA & Special Topics from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score # PCA X_scaled = StandardScaler().fit_transform(X) pca = PCA(n_components=0.95) # Keep 95% of variance X_pca = pca.fit_transform(X_scaled) print(f"Reduced {X.shape[1]} → {X_pca.shape[1]} components") # K-Means with elbow for k in range(2, 11): km = KMeans(n_clusters=k, n_init=10).fit(X_pca) print(f"k={k}: silhouette={silhouette_score(X_pca, km.labels_):.3f}") # DBSCAN db = DBSCAN(eps=0.5, min_samples=5).fit_predict(X_pca) print(f"Clusters: {len(set(db))-1}, Outliers: {(db==-1).sum()}") # Imbalanced dataset handling from imblearn.over_sampling import SMOTE X_res, y_res = SMOTE().fit_resample(X_train, y_train) # Never evaluate on resampled data!