These are exam preparation notes, subpar in quality and certainly not of divine quality.
Choice of error function
Usually squared error is used.
:= different classes (classification / symbol representation)
In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.
The delta between Cross-Entropy and True entropy can be measured by The KL-divergence
We have the problem that we do not know the true distribution, and also not . (After all that is what we try to figure out in the first place)
Our expectation is that the following is true:
where is already the squared error.
Empirical Risk Minimization (ERM)
Based on what we learned from Cross-Entropy, to reduce , we need to reduce .
If your model complexity exceeds the complexity of the data, the only thing you start to fit to is the noise in the data.
In general you want to minimize the generalization error which is usually approximated using a test error . In order to not overfit, you can have a separate validation dataset. or do things like cross validation.
Overfitting <–> Underfitting
Cross Validation is usually done in an n-fold way. That means that the data is dividided into random similarly sized buckets. The training is then done on all but one buckets (e.g. 9/10) and the testing on the remaining one (1/10). This is done for all combinations.
Averaging the 10 errors, it is possible to estimate the generalization error without leaving training data out. The final model is then often trained on the entire data set.
Bias Variance Trade-Off
The data points are the underlying model + noise.
Stochastic approximation Online learning / Gradient Descent
Where is the individual (squared) error for the data point .
Often gradient descent is done in mini batches.
Adapt step size: Big step size, when error decreases, small step size when error increases.
The brings some prior knowledge into the solution.
To paraphrase this, the pulls the solution into a direction that will likely reduce regularization error (or otherwise reduce the risk of errors)
- Weight decay always subtract a part of your from your w
- Different Norms (0.1,.5,1,2,4) General form:
Parametric vs. Nonparametric Algorithms
Adapted from Parametric Algorithms have a fixed number of parameters (and therefore model complexity). Examples:
Logistic Regression Linear Discriminant Analysis Perceptron Naive Bayes Simple Neural Networks
Nonparametric algorithms make few assumptions about the underlying data. Examples:
k-Nearest Neighbors Decision Trees like CART and C4.5 Support Vector Machines