Neural Networks

These are exam preparation notes, subpar in quality and certainly not of divine quality.

See the index with all articles in this series

Connectionist Neurons

A neural network generally has a number of inputs \(x_1...x_N\) which are aggregated into \(\underline x\). At each node there is a transfer function \(y_i\) which turns the inputs according to weights \(\underline w\) into its own output.

A typical function would look like this. The part in the brackets is later referred to as \(h\)

\[y_i = f(\sum_j {w_{ij}x_j-\theta_i})\]

Some of the functions being used for \(f\):

Logistic Function:

\[f{(h)}=\frac{1}{1+exp(-\beta h}\]

Hyperbolic tangent:

\[f{(h)}=tanh(\beta h)\]

Linear neuron:

\[f(h)=\beta h\]

Binary neuron:

\[f(h)=sign(h)\]

The \(\beta\) is a slope parameter.

TODO:Transformation

\[\frac{1}{2}(tanh\frac{1}{2} + 1)\]

The first weight or input is a bias node which is always 1. It is not always included in equations as \(w\) or \(x\). The bias node could be seen as the length of w(???)

Reasons for nonlinear transfer functions:

multiple layers could beexpressed as one in linear transfer functions (main reason)
sign function for classification problems (0,1)
logistic sigmoidal for probabilities (0..1)

Important variables:

\(\theta\) threshold
\(v \text{ and } v'\) layer

Types of Neural Networks

Recurrent Neural Networks
There can be loops in the graph
Feedforward Neural Networks (DAG)
no loops
Radial Basis Function Networks

Typical Usecase: Prediction of attributes. MLPs are universal approximators

Always from \(\mathbb{R}^N\) to anything, really.

NN in Regression

Hessian matrix is second derivative of \(R^N\mapsto R\)

Jacobian matrix is first derivative of \(R^{N}\mapsto R^M\).

Hessian is often too computationally expensive to compute and therefore backpropagation is often used instead of Newton’s Method.

Generalization Error

ERM

Test Error

The test error is reduced using gradient descend.

\[w_{ij}^{v'v}(t+1)=w_{ij}^{v'v}(t) - \hat\eta \frac{\partial E}{\partial w}w_{ij}^{v'v}\]

where

\[\frac{\partial E}{\partial w}w_{ij}^{v'v} = \frac{1}{p}\sum^p_{\alpha=1}{\frac{\partial e^{(\alpha)}_{[\underline w]}}{\partial w_{ij}^{v'v}}}\]

The error is usually quadratic error

\[e(y_T,\underline x) = \frac{1}{2}(y_T-y(\underline x))^2\]

The derivative is trivially:

\[\frac{\partial e^{(\alpha)}}{\partial y_(\underline x ^{(\alpha)}, \underline w )} = y_{(\underline x^{(\alpha)}}-y_T\]

and is later used in backpropagation.

Backpropagation

In backpropagation the weights of the neural network are adjusted so that the test error is reduced. This is achieved by

Calculating the prediction
Calculating the test error
Going back layer by layer and calculating the delta each time *

It would be possible to do backpropagation by applying the chain rule. But that is a lot more computationally expensive than Backpropagation.

Regularization in Deep Learning

Dropout randomly ignores neurons

Architectures

Convolutional layer

Layer that is only connected to selected previous neurons. For example this can be used in image recognition, having neurons only be connected to some adjacent previous pixels ( a tensor).

Spatial/Feature pooling

Trying to detect features in an image even though the image may be rotated, translated, etc. There are then e.g. three different detection units for a specific pattern that is then aggregated by a neuron with a max() function, to recognize the correctly oriented feature.

Auto-Encoders

Unfortunately excluded from the exam, therefore neglected here

Basically you take an image of what you want to recognize and push it through your network. What you get is a “compressed” version of the image (there is a lot less information in the final layers). In the beginning of your training this will be just noise / randomness. You then have another neural network (the same???) reconstruct the original image.

What is then possible is to compare the reconstruction to the original image and generate error values from it.

You thereby can train two neural networks to meaningfully abstract from images without having to have labelled images.

Time Series

In a time series it is often assumed that y depends on a short time window. Therefore there are convolutions, where some neurons can look “back” in time.

Recurrent NN

Neural Network is “shifted” through time. All previous inputs are summarized as a vector with a weight vector \(W\) containing th mapping on itself.

n number of timesteps

Cost function with:

\[E^T = \frac{1}{p}\sum^p_{\alpha = 1}(\frac{1}{n_\alpha}\sum^n_{t=1}e^{(\alpha,t)})\]

There is a vector \(\underline W\) which contains the weights that measure how much of the previous input should be considered in the next timestep.

TODO: Are the weights \(\underline W\) different for each timestep?

Backpropagation through time

Works just like regular backpropagation.

Assume all \(\underline W^{(t)}\) are independent
compute gradients with backpropagation
All computed gradients are averaged for weight update.

\[\Delta \underline W = -\eta \frac{\partial E^T}{\partial \underline W}\]

Exploding / Vanishing gradient

\[\underline W = \underline U \underline \Lambda \underline U ^\intercal\]

One problem of RNNs is that activity is often either vanishing or exploding over time, when \(|y_i|\neq 0\) .

Echo State Networks

Echo state networks set W and U so that their

y_i

is almost equal to r. (TODO why is r in range 1.3<->3 ?)

Leaky Units:

There are units that specialize in long or short term memory. This depends on a factor \(\alpha\)

LSTM

Delay update of hidden layer
Special transfer function (only retrieve state in certain cases)

Radial Basis Function Networks

Also see Wikipedia.

3-layered Radial basis function network

A radial basis function is a function that is only dependent on the distance from the center(Usually Eucleadian distance).

\[\phi_i(\underline x) = \overset{~}{\phi_i}(D[\underline x, \underline t_i])\]

Gaussian function often used:

\[\phi_i(\underline x) = exp(-\frac{||\underline x -\underline t_i||^2}{2\sigma^2_i}))\]

Learning with RBFs

Three different parameters:

\(\underline t_i\) centroid (center of basis function)
\(\sigma_i\) range of influence
\(w_i\) weights of the output layer

2-Step Learning procedure is an alternative to normal learning of parameters.

Find centroids and variances \(\sigma_i\)
Determine output weights \(underline w_i\)

Find centroids and variances

Use k-means clustering to find centroids

Choose \(\sigma_i\) so that it is double the distance of the closest two centroids.

\[\sigma_i= \lambda \underset {j\neq i}{min} ||\underline t _i -\underline t_j||, \lambda \approx 2\]

Determine output weights

Output weights are found reducing quadratic error with M := number of RBFs:

\[E^T = \frac{1}{2p}\sum^p_{\alpha=1}(y_t^{(\alpha)} -\sum^M_{i=1} (w_i\phi_{i(\underline x^{(\alpha)})}))^2\]

Pseudo-inverse

\[(\underline \Phi^T \underline \Phi ) \underline w = \underline \Phi ^T \underline y _T \implies \underline w = (\underline \Phi ^T \underline \Phi )^{-1} \underline \Phi ^T \underline y _T\]

TODO: Do we now use Gradient Descent or invertible matrix?

MLP vs RBF

RBFs have fast convergence, as few parameters needs to be changed per training point, as they have negligible influence on far away points.

RBFs fall under curse of dimensionality, need \(n^d\) basis functions. (n number of data points along one dimension, d number of dimensions)

RBFs are kernel functions that make it possible to map non-linear data into linearity and then do regression on them.

RBFs are useful for low-dimensional data.