These are exam preparation notes, subpar in quality and certainly not of divine quality.
A neural network generally has a number of inputs which are aggregated into . At each node there is a transfer function which turns the inputs according to weights into its own output.
A typical function would look like this. The part in the brackets is later referred to as
Some of the functions being used for :
The is a slope parameter.
The first weight or input is a bias node which is always 1. It is not always included in equations as or . The bias node could be seen as the length of w(???)
Reasons for nonlinear transfer functions:
- multiple layers could beexpressed as one in linear transfer functions (main reason)
- sign function for classification problems (0,1)
- logistic sigmoidal for probabilities (0..1)
Types of Neural Networks
- Recurrent Neural Networks
There can be loops in the graph
- Feedforward Neural Networks (DAG)
- Radial Basis Function Networks
Typical Usecase: Prediction of attributes. MLPs are universal approximators
Always from to anything, really.
NN in Regression
Hessian matrix is second derivative of
Jacobian matrix is first derivative of .
Hessian is often too computationally expensive to compute and therefore backpropagation is often used instead of Newton’s Method.
The test error is reduced using gradient descend.
The error is usually quadratic error
The derivative is trivially:
and is later used in backpropagation.
In backpropagation the weights of the neural network are adjusted so that the test error is reduced. This is achieved by
- Calculating the prediction
- Calculating the test error
- Going back layer by layer and calculating the delta each time *
It would be possible to do backpropagation by applying the chain rule. But that is a lot more computationally expensive than Backpropagation.
Regularization in Deep Learning
- Dropout randomly ignores neurons
Layer that is only connected to selected previous neurons. For example this can be used in image recognition, having neurons only be connected to some adjacent previous pixels ( a tensor).
Trying to detect features in an image even though the image may be rotated, translated, etc. There are then e.g. three different detection units for a specific pattern that is then aggregated by a neuron with a max() function, to recognize the correctly oriented feature.
Unfortunately excluded from the exam, therefore neglected here
Basically you take an image of what you want to recognize and push it through your network. What you get is a “compressed” version of the image (there is a lot less information in the final layers). In the beginning of your training this will be just noise / randomness. You then have another neural network (the same???) reconstruct the original image.
What is then possible is to compare the reconstruction to the original image and generate error values from it.
You thereby can train two neural networks to meaningfully abstract from images without having to have labelled images.
In a time series it is often assumed that y depends on a short time window. Therefore there are convolutions, where some neurons can look “back” in time.
Neural Network is “shifted” through time. All previous inputs are summarized as a vector with a weight vector containing th mapping on itself.
n number of timesteps
Cost function with:
There is a vector which contains the weights that measure how much of the previous input should be considered in the next timestep.
TODO: Are the weights different for each timestep?
Backpropagation through time
Works just like regular backpropagation.
- Assume all are independent
- compute gradients with backpropagation
- All computed gradients are averaged for weight update.
Exploding / Vanishing gradient
One problem of RNNs is that activity is often either vanishing or exploding over time, when .
Echo State Networks
|Echo state networks set W and U so that their||y_i||is almost equal to r. (TODO why is r in range 1.3<->3 ?)|
There are units that specialize in long or short term memory. This depends on a factor
- Delay update of hidden layer
- Special transfer function (only retrieve state in certain cases)
Radial Basis Function Networks
Also see Wikipedia.
A radial basis function is a function that is only dependent on the distance from the center(Usually Eucleadian distance).
Gaussian function often used:
Learning with RBFs
Three different parameters:
- centroid (center of basis function)
- range of influence
- weights of the output layer
2-Step Learning procedure is an alternative to normal learning of parameters.
- Find centroids and variances
- Determine output weights
Find centroids and variances
Use k-means clustering to find centroids
Choose so that it is double the distance of the closest two centroids.
Determine output weights
Output weights are found reducing quadratic error with M := number of RBFs:
TODO: Do we now use Gradient Descent or invertible matrix?
MLP vs RBF
RBFs have fast convergence, as few parameters needs to be changed per training point, as they have negligible influence on far away points.
RBFs fall under curse of dimensionality, need basis functions. (n number of data points along one dimension, d number of dimensions)
RBFs are kernel functions that make it possible to map non-linear data into linearity and then do regression on them.
RBFs are useful for low-dimensional data.