Clustering - k-means & SOM

Chapters

General Terms and tools
PCA
- PCA
- Hebbian Learning
- Kernel-PCA
Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA
Stochastic Optimization
Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding
Estimation Theory
- Density Estimation
  - Kernel Density Estimation
  - Parametric Density Estimation
- Mixture Models - Estimation Models

K-means Clustering

K-means Clustering is good at finding equally sized clusters of data points.

Parameters

Distance Function (Usually Euclidean)
Number of clusters

Drawbacks

???Cannot cope with clusters of different sizes???
Cannot automatically select the number of clusters (resolution)

Two algorithms

Batch K-Means
Online K-Means

Online K-Means is superior because it is less likely to converge to local minima and can be used for streaming data (that’s why it’s ‘online’)

Batch K-Means

$q$ is the index of the prototype

$\underset{―}{w}$ is the vector containing the assignments. Every datapoint can only be assigned to one prototype.

$m_{q}^{(α)}$ is our assignment vector of the datapoints.

initialize prototypes randomly around center of mass
loop arbitrarily
- Assign all data points to their closest prototype $m_{q}$
- choose a ${\underset{―}{w}}_{q}$ that minimizes the error function. –> In case of Euclidian Distance this means the center of mass
  ${\underset{―}{w}}_{q} = \frac{\sum_{α}^{p} m_{q}^{(α)} {\underset{―}{x}}^{(α)}}{\sum_{α}^{p} m_{q}^{(α)}}$
  The sum of all datapoints in the cluster divided by the number of datapoints in the cluster.
- Set new ${\underset{―}{w}}_{q}$
end loop

Problem

Batch K-Means is not a Convex optimization problem. A convex optimization problem would guarantee that a local minimum is a global minimum. This means that we are note certain to converge to the optimum.

Online K-Means

Initialize all $w_{q}$ randomly around first data point
Select learning rate $ε$
for ${\underset{―}{x}}^{(α)}$ in $\underset{―}{x}$
- find closest prototype ${\underset{―}{w}}_{q}$
- nudge prototype a bit in the direction of the data point
  $Δ {\underset{―}{w}}_{q} = ε (x^{(α)} - {\underset{―}{w}}_{q})$

Online K-Means is less likely to be stuck in a local minimum.

Robustness

Labels are not ordered

A good choice of number of clusters $M$ should be stable/repeatable. Therefore K-Means can be tried with a different number of clusters each to see which number $M$ of clusters is stable (with different initializations of course).

Validation

TODO: page 19

The following is likely wrong, better read it up in the original paper¹:

Algorithm:

for m in range(m_min,m_max):
  for i, x_split in enumerate(split_in_pieces(x,r)):
    Y[split] = find_clusters(x_split, m)
    # TODO Compute dissimilarities
    dissimilarity(Y2, Y1)

Pairwise Clustering

Pairwise Clustering is the clustering we are used to (e.g. $m$ partitions), but instead of the original data points, only the distances are available to us. The distances are collected in a matrix $p \times p$ . This could happen e.g. when a kernel trick was used or the measurements taken were already dissimilarity measurements.

Mean-Field Approximation(not very relevant)

simulated annealing(slow)
mean-field approximation(good and fast, robust against locoal optima)

Soft-K-means clustering (Euclidean distances) (Very Relevant)

Algorithm

choose number of partitions $M$
Choose parameters:
- initial noise ( $β_{0}$ )
- final noise ( $β_{f}$ )
- annealing factor $η$
- convergence criterion $θ$
initialize prototypes

{\underset{―}{w}}_{q} = \frac{1}{p} \sum_{α}^{p} {\underset{―}{x}}^{(α)} + random noise

while $β < β_{f}$
- repeat EM until
$| {\underset{―}{w}}_{q}^{n e w} - {\underset{―}{w}}_{q}^{o l d} | < θ \forall q$
- compute assignment probabilities (look up formular)
- Compute new prototypes ${\underset{―}{w}}_{q}^{n e w} = \frac{\sum_{α}^{p} (m_{q}^{(α)})_{Q} {\underset{―}{x}}^{(α)}}{\sum_{α}^{p} (m_{q}^{(α)})_{Q}} \forall q$
- $β \overset{+}{=} η β$

Self-Organising Maps (SOM)

Self-Organizing Maps are useful in dimensionality reduction while preserving neighborhood. At the same time it is a clustering method as opposed to LLE, which projects all data points into the lower dimensional space(see below).

Parameters

$M$ partitions/neurons
annealing schedule learning rate $ε$ and annealing factor $σ$
prototypes ${\underset{―}{w}}_{\underset{―}{q}}$ center of mass $\pm$ random noise

Algorithm

for x_a in random(x)
- get closest prototype
$\underset{―}{p} = \underset{\underset{―}{r}}{argmin} | {\underset{―}{x}}^{(α)} - {\underset{―}{w}}_{\underset{―}{r}} |$
Note: $p$ is in the map space while ${\underset{―}{x}}^{(α)}$ is in data space
- Change all prototypes using:
$Δ {\underset{―}{w}}_{\underset{―}{q}} = ε h (\underset{―}{q}, \underset{―}{p}) ({\underset{―}{x}}^{(α)} - {\underset{―}{w}}_{\underset{―}{q}}) for all q$

Example choice of $h (\underset{―}{q}, \underset{―}{p})$

h (\underset{―}{q}, \underset{―}{p}) = e^{- \frac{(\underset{―}{q} - \underset{―}{p})^{2}}{2 δ^{2}}}

Annealing $σ$

Decrease $σ$ slowly over time

This algorithm would be similar to k-means if one were to update only $\underset{―}{q} given that \underset{―}{q} = \underset{―}{p}$ with $σ = 0$ . It would mean that only the closest prototype would be updated by the squared error. Also the exponential is not contained in k-Means.

Locally Linear Embedding

“locally linear embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data.”² In Locally Linear Embedding the idea is to split up the data into small patches and then for each patch and data point there are the weights $\underset{―}{W}$ that allow an optimal reconstruction of the data points as well as a embedding coordinates $\underset{―}{U}$

Algorithm

Parameters $K, M$

find K nearest neighbours of $K N N ({\underset{―}{x}}^{(α)}) = {β_{1}^{(α)}, . . ., β_{K}^{(α)} \forall α = 1, . . ., p}$
calculate reconstruction weights $\underset{―}{W}$
calculate embedding coordinates $\underset{―}{U}$

2. Cost function

$E$ is our error function. $p$ is the number of data points. $W_{α β}$ are the weights

E (\underset{―}{W}) = \sum_{α = 1}^{p} {| {\underset{―}{x}}^{(α)} - \sum_{β = 1}^{p} W_{α β} {\underset{―}{x}}^{(β)} |}^{2} \overset{!}{=} m i n s.t. W_{α β} = 0 if β \notin K N N (x^{(α)}) \sum_{β = 1}^{p} W_{α β} = 1

LLE is very good at preserving neighboorhoods in high dimensional data. Is Convex

Footnotes

Stability-Based Validation of Clustering Solutions ↩
Roweis, Saul: An Introduction to Locally Linear Embedding ↩