Practical Pitfalls Flashcards by O Cam

What is data leakage in ML and statistical modeling?

Using information during training or feature creation that would not be available at prediction time, leading to overly optimistic evaluation and poor deployment performance.

How well did you know this?

Not at all

Perfectly

What is target leakage specifically?

A type of data leakage where features directly or indirectly encode the target variable using information from the future or post-outcome events.

How well did you know this?

Not at all

Perfectly

Why is target leakage particularly dangerous?

It can make models appear extremely accurate in offline evaluation, only to collapse when deployed because the leaked information is absent.

How well did you know this?

Not at all

Perfectly

What is an example of target leakage in a credit risk model?

Including a feature like ‘loan written off’ or ‘days delinquent after default date’ when predicting default at application time.

How well did you know this?

Not at all

Perfectly

How can cross-validation be misused to create leakage?

By computing feature transformations, scaling, or imputations on the full dataset before splitting, so information from validation folds influences training.

How well did you know this?

Not at all

Perfectly

What is look-ahead bias in time-series modeling?

Using data from the future in training or evaluation when simulating predictions that would have been made in the past.

How well did you know this?

Not at all

Perfectly

How do you avoid look-ahead bias in time-series evaluation?

Use chronological splits where training uses only past data and validation/test use future periods.

How well did you know this?

Not at all

Perfectly

Why is using the test set repeatedly for model tuning a pitfall?

It effectively turns the test set into another validation set, causing optimistic bias in reported performance.

How well did you know this?

Not at all

Perfectly

What is the correct role of the test set in ML experiments?

To provide a final, unbiased estimate of performance after all model selection and tuning decisions are complete.

How well did you know this?

Not at all

Perfectly

What is selection bias in datasets?

Bias introduced when the observed data are not a random sample from the target population, often due to the way data are collected or filtered.

How well did you know this?

Not at all

Perfectly

How can selection bias affect model performance in production?

Models trained on biased samples may perform poorly when deployed to a broader or different population.

How well did you know this?

Not at all

Perfectly

What is covariate shift?

A change between training and deployment in the distribution of input features while the conditional distribution of outputs given inputs remains the same.

How well did you know this?

Not at all

Perfectly

What is label shift?

A change in the distribution of labels across environments with relatively stable conditional distribution of inputs given labels.

How well did you know this?

Not at all

Perfectly

Why is ignoring distribution shift a pitfall?

Models evaluated only under the training distribution may fail when real-world conditions change, leading to unexpected degradation.

How well did you know this?

Not at all

Perfectly

What is class imbalance and why is it problematic?

When one class is much more frequent than others; naive models can achieve high accuracy by predicting the majority class while failing on the minority.

How well did you know this?

Not at all

Perfectly

What metric-related mistake is common with imbalanced data?

Study These Flashcards

Relying on accuracy instead of metrics like precision, recall, F1, or PR-AUC that focus on the minority class.

What is overfitting in the context of model complexity?

Study These Flashcards

When a model fits noise and idiosyncrasies in the training data, achieving low training error but high test error.

Why is evaluating a very flexible model on a tiny validation set a pitfall?

Study These Flashcards

Random noise in the small validation set can mislead model selection, making unstable models look best by chance.

What is the danger of p-hacking in statistical analysis?

Study These Flashcards

Testing many hypotheses or analysis variants and only reporting significant ones inflates the false positive rate and undermines trust in results.

Why is ‘p<0.05’ not adequate evidence on its own?

Study These Flashcards

It ignores effect size, uncertainty, prior plausibility, multiple testing, and costs/benefits; context is essential.

What is a common misinterpretation of a 95% confidence interval?

Study These Flashcards

Believing there is a 95% probability that the true parameter lies in this specific interval, rather than understanding it as a long-run coverage property.

Why is extrapolating far beyond the range of training data risky?

Study These Flashcards

Model relationships that hold within the observed range may not hold outside it, leading to wildly inaccurate predictions.

What is label noise?

Study These Flashcards

Errors or inconsistencies in target labels, such as misclassifications or ambiguous outcomes.

How can heavy label noise affect model performance and evaluation?

Study These Flashcards

It can cap achievable accuracy, cause models to overfit spurious patterns, and distort metrics if not accounted for.

What pitfall arises from ignoring missing data mechanisms?

Assuming missingness is random when it depends on unobserved factors can bias estimates and model predictions.

What is MCAR vs MAR vs MNAR in missing data?

Missing Completely At Random, Missing At Random (depends only on observed data), and Missing Not At Random (depends on unobserved values).

Why is deleting all rows with missing values often a bad idea?

It can waste data and introduce bias if missingness is related to the outcome or covariates.

What is data snooping in feature engineering?

Looking at test or validation labels while creating features or selecting variables, thereby bleeding information from evaluation sets into training.

Why is scaling features on the whole dataset before splitting a subtle pitfall?

It uses statistics from test/validation data inside training transformations, introducing mild leakage and optimistic estimates.

What is multiple comparisons in ML model search?

Training many models or evaluating many metrics and then cherry-picking the best result without accounting for the number of tried alternatives.

How can ensembling be misused as a pitfall?

Blindly stacking many models without understanding their diversity and error correlation can lead to opaque systems that are hard to debug or govern.

Why is ignoring uncertainty in metrics a practical pitfall?

Treating noisy metric differences as real improvements may cause unnecessary model churn and degraded performance in production.

What is a train/test contamination through temporal ordering?

Randomly splitting time-dependent data so that future events appear in the training set or leak into feature calculations for past test instances.

Why is ignoring business constraints a statistical pitfall?

A model that optimizes metrics but violates latency, fairness, or operational constraints will not succeed in real deployment.

What is survivorship bias in datasets?

Bias that occurs when only entities that 'survive' some process are observed, missing those that dropped out or failed earlier.

How can survivorship bias distort models?

They may learn from only the successes or survivors, misestimating risk or performance for new cases.

Why is blindly trusting default library settings a pitfall?

Defaults may not suit your data’s scale, distribution, or problem type, leading to suboptimal or misleading results if not examined.

What is a key sign that a result may be 'too good to be true'?

Extremely high metrics compared to baseline or domain expectations, especially if accompanied by complex pipelines and minimal scrutiny for leakage.

In one sentence, what is the overarching lesson about practical statistical pitfalls?

Be suspicious of surprisingly good results, keep evaluation data sacred, respect time and data-generating processes, and always ask how your data and setup could be lying to you.

Practical Pitfalls Flashcards

(39 cards)