What is overfitting?
The tendency of data mining procedures to tailor models to the training data at the expense of generalization to previously unseen data points
What is generalisation?
Is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
Goal is to build models that apply to the general population, and not just the training data.
Goal: build models that generalize beyond the training data.
What is a table model?
Why is overfitting bad?
What is a fitting graph and why is it useful for overfitting analysis?
A graph that shows the accuracy of the model as a function of complexity.
The fitting graph shows the difference between a modeling procedure’s accuracy on the training data and the accuracy on holdout data as model complexity changes. Generally, there will be more overfitting as one allows the model to be more complex.

What is the base rate?
A classifier that always selects the majority class is called a base rate classifier.
In other words, it is referring to the table model. Since the table model always predicts no churn for every new case with which it is presented, it will get every no churn case right and every churn case wrong. (not sure about this)
Overfitting in Classification tree / Tree induction
What is the overfitting in mathematical functions?
What is the overfitting in linear functions SVM and Logistic regression?
What is cross validation?

What is holdout data?
What is a learning curve?
What is the difference between a fitting graph a learning curve?
Learning curves shows the generalization performance – the performance only on testing data, plotted against the amount of training data used.
Fitting graphs show the generalization performance as well as the performance on the training data plotted against model complexity.
Generally shown for a fixed amount of training data
How do the learning curves of classification trees and logistic regression compare?
Given the same set of features, classification trees are a more flexible model representation than linear regression
How can you mitigate/avoid overfitting in tree induction?
What is nested holdout testing?
What is Sequential Forward Selection (SFS)?
What is Sequential Backward Elimination?
Sequential forward selection (SFS) → a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing the first feature, SFS tests all models by adding a second feature and so on. It adds one feature at a time until newly added features do not increase accuracy.
Process ends when adding a feature does not improve classification accuracy.
Sequential backward elimination of features works by starting with all features and discarding features one at a time. It continues to discard features as long as there is no performance loss.
What is nested cross-validation?
Nested cross-validation → For each fold you run a separate sub-cross-validation first to find the complexity parameter C
Before building the model for each fold, you run an entire cross-validation on just the training set to find the value of C that gives the best accuracy
Result is used to set C in the actual model for that fold of the cross validation
Then another model is built using the entire training fold using that value for C and is tested on the corresponding test fold
The only difference from regular cross-validation is that for each fold we first run this experiment to find C, using another, smaller, cross-validation.
Why do decision trees overfit? and logistic-regression or SVM? How can you counteract overfitting?
If unbounded, 100% accuracy is possible with decision trees as it continues to make the leaves more specific, based on smaller sets of data. Thereby learning dataset specific patterns that don’t generalies across a population.
Logistic regression finds a dividing hyperplane if there is one. If it has too many degrees of freedom it can easily fit the training data.
SVM overfits for the same reason, when it has too many free variables/dimensions.
You can counteract overfitting by reducing model complexity or increase complexity while measuring accuracy on a train and holdout set and stopping when accuracy starts to diverge. For trees specify a minimum number of instances that must be present in a leaf. The second strategy for trees is ‘pruning’, which is cut off leaves and branches, replacing them with leaves. You could also use hypothesis testing to test whether the observed difference was due to chance. More generally, culling feature sets. Also, regularisation.
In k-fold cross validation, what is the maximum amount of k-folds that can be used? In what cases would this strategy be useful?
The amount of data points you have. It is useful for smaller datasets. When it is too costly to gather more data. Called leave-one-out cross-validation.
The maximum amount of folds is the same as the number of observations and this is called leave-one-out cross-validation (LOOCV). LOOCV provides an unbiased estimation of the test error, but it also increases variability, since it is based in a single observation. This strategy is useful in cases where we have a natural limited amount of data, such as rare medical conditions.
How can cross validation and fitting graphs help in checking for over-fitting?
Cross validation provides a better understanding of the generalisation accuracy. Fitting graphs can help you find the sweet spot in terms of complexity.
Combined, using cross validation to compute the accuracy used in your fitting graph you can be more certain that the outcome is valid. Doing so results in a fitting graph that includes a confidence interval.
Explain in detail the steps to build a fitting graph for a decision tree and for a logistic regression classification model.
In a decision tree parametrise the complexity of the tree as the amount of nodes or the depth of the tree.
Repeat until satisfied overfitting:
For logistic regression the same applies just not with depth or nodes, rather using the amount of variables it is allowed to use.
Sketch a learning curve for a decision tree with holdout and training.

In Session 4’s Assignemnt:
If you were to create a new classification tree based on information from Table 2,
could you then test/validate with the data on Table 1? Explain your reasoning.
Hint: : Check the ratio of digital subscriptions vs non-digital subscriptions.
No, the proportion with respect to the target variable varies significantly.