# Unsupervised Learning Methods Exam Preparation

Following the principle “You only understood something thoroughly if you can explain it” - here come the prepping notes for Machine Intelligence II. If no sources are indicated, it comes from the lecture slides.

**Note** This was foremostly written for my own understanding, so it might contain incomplete explanations

## Chapters

- General Terms and tools
- PCA
- PCA
- Hebbian Learning
- Kernel-PCA

- Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA

- Stochastic Optimization
- Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding

- Estimation Theory
- Density Estimation
- Kernel Density Estimation
- Parametric Density Estimation

- Mixture Models - Estimation Models

- Density Estimation

## General Terms and tools

A lot of the different methods rely on some general methodology that will be reused. Need a refresher in Matrix multiplication? Oh, and dot product is the same as a scalar product.

### Centered Data

Centering data means making the center of mass 0. This means every dimension is averaged and the average is then subtracted from all data points for each dimension.

\[X = X - \frac{1}{p}\sum_{\alpha}^p x^{(\alpha)}\]also called first moment

or with numpy:

### Covariance matrix

Assuming \(p\) centered data points \(x^{(\alpha)}\):

### Whitened Data

Whitening turns your data matrix into a matrix with a covariance matrix which is an identity matrix. The data is then uncorrelated (but might be dependent). This is useful to find e.g. outliers.

\[\begin{align*} C_{ij} &= \frac{1}{p}\sum_{\alpha=1}^p x_i^{(\alpha)}x_j^{(\alpha)} \\ &or\\ C &= \frac{1}{p}\underline{x}^T\underline{x} \end{align*}\]### Kullback-Leibler-Divergence

Kullback-Leibler-Divergence measures the difference / distance between two probability distributions - in this example \(P\) and \(\hat P\)

\[D_{KL}[P(\underline x) , \hat{P}(\underline x; \underline w)] = \int d \underline x P(\underline x)ln \frac{P(\underline x)}{\hat{P}(\underline x; \underline w)} = \underset{(\underline w)}{min}\]### Jacobian Matrix

For a function \(f: I\!R ^n \rightarrow I\!R^m\) the Jacobian matrix is filled with the derivatives

\[\begin{bmatrix} \frac{\partial f_1}{\partial x_1}& \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n}& \cdots & \frac{\partial f_m}{\partial x_n}\\ \end{bmatrix}\]### Mercers theorem

From the slides:

Every positive semidefinite definite kernel k corresponds to a scalar product in some metric feature space

### Markov Process

A markov process is only dependent on the most recent state. E.g. Its probabilities into which state it will go next are independent of any older states.

### Variance

\[\sigma ^2 = E[(x-\mu)^2]\]**Discrete**