How can learning be viewed as Optimisation?
How to decompose errors into bias and variance?
Error = bias^2 + variance + noise
What is bias?
How well our model can correctly predict the data
What is variance?
How well our model can respond to new data
How to reduce overfitting?
Need to dampen the complexity, smoothing it out.
What is L1 Regularisation?
L1 weight regularisation penalises weight values by adding the sum of their absolute values to the error term
L1 regularisation encourages solutions where many parameters are zero
e.g. Lasso algorithm
What is L2 Regularisation?
L2 weight regularisation penalises weight values by adding the sum of their squared values to the error term
L2 regularisation encourages solutions where most parameter values are small.
e.g. Linear Regression
Batch vs Stochastic Gradient Descent
Batch: Evaluation of D occurs for entire data set at each iteration
Stochastic: Update is performed for each training instance
How to find the minimum error for regularisation/optimisation?
Gradient Descent to approximate rather than calculate.
Why is it that parameter tuning might lead to overfitting?
-
What is the Gradient Descent method, and why is it important?
Gradient Descent is a mechanism for finding the minimum of a (convex) multivariate function where we can find its partial derivatives.
This is important because it allows us to determine the regression weights which minimise an error function over some training data set.