Hypothesis Testing & p-values Flashcards by O Cam

What is the basic goal of hypothesis testing?

To assess whether observed data are compatible with a specified null hypothesis about a population or model parameter.

How well did you know this?

Not at all

Perfectly

What is a null hypothesis H₀?

A default assumption or claim about a parameter or distribution, such as ‘no effect’ or ‘no difference’ between groups.

How well did you know this?

Not at all

Perfectly

What is an alternative hypothesis H₁?

A competing claim that would be supported if the evidence is strong enough against the null, such as ‘there is a difference’.

How well did you know this?

Not at all

Perfectly

What is a test statistic in hypothesis testing?

A function of the data that measures the degree of discrepancy between the observed data and what would be expected under H₀.

How well did you know this?

Not at all

Perfectly

What is a p-value?

The probability, under the null hypothesis, of observing a test statistic at least as extreme as the one computed from the data.

How well did you know this?

Not at all

Perfectly

Does a small p-value prove that the null hypothesis is false?

No; it indicates that the observed data would be unusual under H₀, but does not provide a direct probability that H₀ is true or false.

How well did you know this?

Not at all

Perfectly

What is a significance level α (alpha)?

A chosen threshold such as 0.05, representing the maximum tolerable probability of rejecting H₀ when it is actually true (Type I error).

How well did you know this?

Not at all

Perfectly

What is a Type I error?

Rejecting a true null hypothesis (a false positive).

How well did you know this?

Not at all

Perfectly

What is a Type II error?

Failing to reject a false null hypothesis (a false negative).

How well did you know this?

Not at all

Perfectly

What is statistical power?

The probability of correctly rejecting H₀ when it is false; i.e., 1 − (Type II error rate).

How well did you know this?

Not at all

Perfectly

Why is power important in experimental design?

Low power means that even real effects are unlikely to be detected, leading to wasted experiments and misleading ‘no effect’ conclusions.

How well did you know this?

Not at all

Perfectly

What factors increase the power of a test?

Larger sample size, larger true effect size, lower variance, and higher significance level (all else equal).

How well did you know this?

Not at all

Perfectly

Why is ‘p<0.05’ not a magic threshold?

It is a convention; real decisions should consider effect size, uncertainty, costs of errors, and context, not just a binary cutoff.

How well did you know this?

Not at all

Perfectly

What is multiple testing or multiple comparisons?

Running many hypothesis tests, which increases the chance of getting at least one small p-value by random chance even when all nulls are true.

How well did you know this?

Not at all

Perfectly

Why is multiple testing a concern in ML and analytics?

Trying many models, features, or metrics and only reporting the best results can lead to overoptimistic conclusions and p-hacking.

How well did you know this?

Not at all

Perfectly

What is an A/B test at a high level?

Study These Flashcards

A controlled experiment comparing outcomes between a control group (A) and a treatment group (B) to assess the effect of a change.

What is randomization in A/B testing?

Study These Flashcards

Assigning units to groups at random so that, on average, groups are comparable and confounders are balanced.

Why is randomization critical for causal interpretation?

Study These Flashcards

It breaks systematic links between treatment assignment and other variables, making differences in outcomes attributable to the treatment with fewer assumptions.

What are typical units of randomization in online A/B tests?

Study These Flashcards

Users, sessions, or requests, depending on the product and outcome being measured.

Why is ‘stable unit treatment value’ (roughly) important in experiments?

Study These Flashcards

It assumes one unit’s outcome is not affected by another unit’s treatment, simplifying interpretation; interference can complicate A/B test analysis.

What is a lift in A/B testing?

Study These Flashcards

The relative change in a metric between treatment and control, often expressed as a percentage increase or decrease.

What is the difference between statistical significance and practical significance?

Study These Flashcards

Statistical significance is about whether an effect is unlikely under H₀; practical significance is whether the effect size is large enough to matter in practice.

Why is it dangerous to stop an experiment as soon as p<0.05 without planning?

Study These Flashcards

Repeated peeking at results inflates Type I error; you effectively perform multiple tests without correction.

What is a pre-analysis plan or fixed-horizon test?

Study These Flashcards

A plan that specifies sample size, metrics, and decision rules in advance and analyzes data only after reaching the planned sample size.

What is sequential testing or online testing?

Approaches that allow monitoring results over time with statistical corrections to maintain valid error rates when checking repeatedly.

Why is sample size planning important before running an A/B test?

To ensure that the experiment has enough power to detect the minimum effect size of interest without running excessively long or short.

What inputs are typically needed for rough sample size calculations?

Baseline metric level, minimum detectable effect size, desired power, and significance level.

What role does variance play in experiment design?

Higher variance in the metric requires larger sample size to detect the same effect with the same power.

Why can observational differences between groups be misleading without randomization?

Groups may differ in many ways besides the treatment, so differences in outcomes may be due to confounders rather than the treatment itself.

What is a confounder?

A variable that influences both the treatment and the outcome, potentially creating a spurious association.

What is a simple way to reduce confounding in observational analysis?

Control for known confounders via stratification, matching, or regression, while recognizing that unknown confounding may remain.

Why are experiments still preferred when feasible, even with advanced observational methods?

Experiments provide cleaner identification of causal effects under simpler, more transparent assumptions.

In ML model evaluation, what is an informal 'hypothesis test'?

Assessing whether one model’s performance metric is substantially better than another’s beyond what could be expected from random variation in the test set.

Why is it useful to think of model comparisons as noisy estimates rather than exact ranks?

Performance metrics on finite data have sampling variability; small differences may be indistinguishable from noise.

What is a good practice when comparing models using metrics?

Report point estimates with confidence intervals or uncertainty estimates, and consider business impact, not just statistical significance.

What is p-hacking?

Manipulating analysis choices or repeatedly testing until a desired p-value is obtained, inflating false discovery rates.

How can teams reduce p-hacking risk?

Pre-registering analysis plans, limiting post-hoc choices, correcting for multiple comparisons, and emphasizing effect sizes and uncertainty.

In one sentence, what should an ML engineer remember about hypothesis tests and p-values?

They are tools for quantifying evidence under a model, not definitive answers; combine them with effect sizes, uncertainty, and domain context when making decisions.

Hypothesis Testing & p-values Flashcards

(38 cards)