What is data leakage in ML and statistical modeling?
Using information during training or feature creation that would not be available at prediction time, leading to overly optimistic evaluation and poor deployment performance.
What is target leakage specifically?
A type of data leakage where features directly or indirectly encode the target variable using information from the future or post-outcome events.
Why is target leakage particularly dangerous?
It can make models appear extremely accurate in offline evaluation, only to collapse when deployed because the leaked information is absent.
What is an example of target leakage in a credit risk model?
Including a feature like ‘loan written off’ or ‘days delinquent after default date’ when predicting default at application time.
How can cross-validation be misused to create leakage?
By computing feature transformations, scaling, or imputations on the full dataset before splitting, so information from validation folds influences training.
What is look-ahead bias in time-series modeling?
Using data from the future in training or evaluation when simulating predictions that would have been made in the past.
How do you avoid look-ahead bias in time-series evaluation?
Use chronological splits where training uses only past data and validation/test use future periods.
Why is using the test set repeatedly for model tuning a pitfall?
It effectively turns the test set into another validation set, causing optimistic bias in reported performance.
What is the correct role of the test set in ML experiments?
To provide a final, unbiased estimate of performance after all model selection and tuning decisions are complete.
What is selection bias in datasets?
Bias introduced when the observed data are not a random sample from the target population, often due to the way data are collected or filtered.
How can selection bias affect model performance in production?
Models trained on biased samples may perform poorly when deployed to a broader or different population.
What is covariate shift?
A change between training and deployment in the distribution of input features while the conditional distribution of outputs given inputs remains the same.
What is label shift?
A change in the distribution of labels across environments with relatively stable conditional distribution of inputs given labels.
Why is ignoring distribution shift a pitfall?
Models evaluated only under the training distribution may fail when real-world conditions change, leading to unexpected degradation.
What is class imbalance and why is it problematic?
When one class is much more frequent than others; naive models can achieve high accuracy by predicting the majority class while failing on the minority.
What metric-related mistake is common with imbalanced data?
Relying on accuracy instead of metrics like precision, recall, F1, or PR-AUC that focus on the minority class.
What is overfitting in the context of model complexity?
When a model fits noise and idiosyncrasies in the training data, achieving low training error but high test error.
Why is evaluating a very flexible model on a tiny validation set a pitfall?
Random noise in the small validation set can mislead model selection, making unstable models look best by chance.
What is the danger of p-hacking in statistical analysis?
Testing many hypotheses or analysis variants and only reporting significant ones inflates the false positive rate and undermines trust in results.
Why is ‘p<0.05’ not adequate evidence on its own?
It ignores effect size, uncertainty, prior plausibility, multiple testing, and costs/benefits; context is essential.
What is a common misinterpretation of a 95% confidence interval?
Believing there is a 95% probability that the true parameter lies in this specific interval, rather than understanding it as a long-run coverage property.
Why is extrapolating far beyond the range of training data risky?
Model relationships that hold within the observed range may not hold outside it, leading to wildly inaccurate predictions.
What is label noise?
Errors or inconsistencies in target labels, such as misclassifications or ambiguous outcomes.
How can heavy label noise affect model performance and evaluation?
It can cap achievable accuracy, cause models to overfit spurious patterns, and distort metrics if not accounted for.