Unsupervised Learning Methods Exam Preparation

Following the principle “You only understood something thoroughly if you can explain it” - here come the prepping notes for Machine Intelligence II. If no sources are indicated, it comes from the lecture slides.

Note This was foremostly written for my own understanding, so it might contain incomplete explanations

Chapters

General Terms and tools
PCA
- PCA
- Hebbian Learning
- Kernel-PCA
Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA
Stochastic Optimization
Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding
Estimation Theory
- Density Estimation
  - Kernel Density Estimation
  - Parametric Density Estimation
- Mixture Models - Estimation Models

General Terms and tools

A lot of the different methods rely on some general methodology that will be reused. Need a refresher in Matrix multiplication? Oh, and dot product is the same as a scalar product.

Centered Data

Centering data means making the center of mass 0. This means every dimension is averaged and the average is then subtracted from all data points for each dimension.

\[X = X - \frac{1}{p}\sum_{\alpha}^p x^{(\alpha)}\]

also called first moment

or with numpy:

# x is our data
x_centered = x - np.mean(x, axis=0)

Covariance matrix

Assuming \(p\) centered data points \(x^{(\alpha)}\):

Whitened Data

Whitening turns your data matrix into a matrix with a covariance matrix which is an identity matrix. The data is then uncorrelated (but might be dependent). This is useful to find e.g. outliers.

\[\begin{align*} C_{ij} &= \frac{1}{p}\sum_{\alpha=1}^p x_i^{(\alpha)}x_j^{(\alpha)} \\ &or\\ C &= \frac{1}{p}\underline{x}^T\underline{x} \end{align*}\]

Kullback-Leibler-Divergence

Kullback-Leibler-Divergence measures the difference / distance between two probability distributions - in this example \(P\) and \(\hat P\)

\[D_{KL}[P(\underline x) , \hat{P}(\underline x; \underline w)] = \int d \underline x P(\underline x)ln \frac{P(\underline x)}{\hat{P}(\underline x; \underline w)} = \underset{(\underline w)}{min}\]

Jacobian Matrix

For a function \(f: I\!R ^n \rightarrow I\!R^m\) the Jacobian matrix is filled with the derivatives

\[\begin{bmatrix} \frac{\partial f_1}{\partial x_1}& \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n}& \cdots & \frac{\partial f_m}{\partial x_n}\\ \end{bmatrix}\]

Mercers theorem

From the slides:

Every positive semidefinite definite kernel k corresponds to a scalar product in some metric feature space

Markov Process

A markov process is only dependent on the most recent state. E.g. Its probabilities into which state it will go next are independent of any older states.

Variance

\[\sigma ^2 = E[(x-\mu)^2]\]

Discrete

\[\sigma ^2 = \frac{1}{p}\sum_{\alpha=1}^p(x_\alpha-\mu)^2\]