These are exam preparation notes, subpar in quality and certainly not of divine quality.

See the index with all articles in this series

Choice of error function

Usually squared error is used.


\(c\) := different classes (classification / symbol representation)

From Wikipedia:

In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.

The delta between Cross-Entropy and True entropy can be measured by The KL-divergence

We have the problem that we do not know the true distribution, and also not \(P(y_T|\underline x)\) . (After all that is what we try to figure out in the first place)

\[E^G \equiv -\sum^c_{k=1}\int{d\underline x}\underset{unknown}{P_{(\underline x )}P_{C_k|\underline x )}} ln( P_{C_k|\underline x ; \underline w)})\]

Our expectation is that the following is true:

\[\hat{E}^G = E^T = \frac{1}{q}\sum^q_{\beta=1}e^{(\beta)}\]

where \(e^{(\beta)}\) is already the squared error.

Empirical Risk Minimization (ERM)

Based on what we learned from Cross-Entropy, to reduce \(E_G\), we need to reduce \(E_T\).


If your model complexity exceeds the complexity of the data, the only thing you start to fit to is the noise in the data.

In general you want to minimize the generalization error \(E_G\) which is usually approximated using a test error \(E_T\). In order to not overfit, you can have a separate validation dataset. or do things like cross validation.

Overfitting <–> Underfitting

Cross Validation

Cross Validation is usually done in an n-fold way. That means that the data is dividided into random similarly sized buckets. The training is then done on all but one buckets (e.g. 9/10) and the testing on the remaining one (1/10). This is done for all combinations.

Averaging the 10 errors, it is possible to estimate the generalization error without leaving training data out. The final model is then often trained on the entire data set.

Aka Resampling.

Bias Variance Trade-Off

Good Explanation of concept

The data points are the underlying model + noise.

\[y_T = y^*_{(\underline x)} + \eta\] \[Err(x)=\text{Bias}^2+\text{Variance}+\text{Irreducible Error}\]

Stochastic approximation Online learning / Gradient Descent

\[\Delta w^{v'v}_{ij} = -\eta \frac{\partial E ^T _{[\underline w]}}{\partial w ^{v'v}_{ij}} = -\eta \frac{1}{p}\sum^p_{\alpha=1} {\frac{\partial e ^{(\alpha)} _{[\underline w]}}{\partial w ^{v'v}_{ij}}}\]

Where \(e^{(\alpha)}\) is the individual (squared) error for the data point \(x^{(\alpha)}\).

Often gradient descent is done in mini batches.


Adapt step size: Big step size, when error decreases, small step size when error increases.


\[R_{[\underline w]}=E^T _{[\underline w]} + \lambda E^R_{[\underline w]} \overset{!}{=} min\]

The \(E^R\) brings some prior knowledge into the solution.

To paraphrase this, the \(E^R\) pulls the solution into a direction that will likely reduce regularization error (or otherwise reduce the risk of errors)


  • Weight decay always subtract a part of your \(\lambda w\) from your w
  • Different Norms (0.1,.5,1,2,4) General form: \(E^R=\sum_{ijv'v} | w_{ij}^{v'v}|^q\)

Parametric vs. Nonparametric Algorithms

Adapted from Parametric Algorithms have a fixed number of parameters (and therefore model complexity). Examples:

Logistic Regression
Linear Discriminant Analysis
Naive Bayes
Simple Neural Networks

Nonparametric algorithms make few assumptions about the underlying data. Examples:

k-Nearest Neighbors
Decision Trees like CART and C4.5
Support Vector Machines