Clustering  kmeans & SOM
Chapters
 General Terms and tools
 PCA
 PCA
 Hebbian Learning
 KernelPCA
 Source Separation
 ICA
 Infomax ICA
 Second Order Source Separation
 FastICA
 Stochastic Optimization
 Clustering
 kmeans Clustering
 Pairwise Clustering
 SelfOrganising Maps
 Locally Linear Embedding
 Estimation Theory
 Density Estimation
 Kernel Density Estimation
 Parametric Density Estimation
 Mixture Models  Estimation Models
 Density Estimation
Kmeans Clustering
Kmeans Clustering is good at finding equally sized clusters of data points.
Parameters
 Distance Function (Usually Euclidean)
 Number of clusters
Drawbacks
 ???Cannot cope with clusters of different sizes???
 Cannot automatically select the number of clusters (resolution)
Two algorithms
 Batch KMeans
 Online KMeans
Online KMeans is superior because it is less likely to converge to local minima and can be used for streaming data (that’s why it’s ‘online’)
Batch KMeans
\(q\) is the index of the prototype
\(\underline w\) is the vector containing the assignments. Every datapoint can only be assigned to one prototype.
\(m_q^{(\alpha)}\) is our assignment vector of the datapoints.
 initialize prototypes randomly around center of mass
 loop arbitrarily
 Assign all data points to their closest prototype \(m_q\)

choose a \(\underline w_q\) that minimizes the error function. –> In case of Euclidian Distance this means the center of mass
\[\underline w_q = \frac{\sum^p_\alpha m_q^{(\alpha)}\underline x^{(\alpha)}}{\sum^p_\alpha m_q^{(\alpha)}}\]The sum of all datapoints in the cluster divided by the number of datapoints in the cluster.
 Set new \(\underline w_q\)
 end loop
Problem
Batch KMeans is not a Convex optimization problem. A convex optimization problem would guarantee that a local minimum is a global minimum. This means that we are note certain to converge to the optimum.
Online KMeans
 Initialize all \(w_q\) randomly around first data point
 Select learning rate \(\varepsilon\)
 for \(\underline x^{(\alpha)}\) in \(\underline x\)
 find closest prototype \(\underline w_q\)

nudge prototype a bit in the direction of the data point
\[\Delta\underline w_q = \varepsilon (x^{(\alpha)}\underline w_q)\]
Online KMeans is less likely to be stuck in a local minimum.
Robustness
Labels are not ordered
A good choice of number of clusters \(M\) should be stable/repeatable. Therefore KMeans can be tried with a different number of clusters each to see which number \(M\) of clusters is stable (with different initializations of course).
Validation
TODO: page 19
The following is likely wrong, better read it up in the original paper^{1}:
Algorithm:
Pairwise Clustering
Pairwise Clustering is the clustering we are used to (e.g. \(m\) partitions), but instead of the original data points, only the distances are available to us. The distances are collected in a matrix \(p \times p\). This could happen e.g. when a kernel trick was used or the measurements taken were already dissimilarity measurements.
MeanField Approximation(not very relevant)
 simulated annealing(slow)
 meanfield approximation(good and fast, robust against locoal optima)
SoftKmeans clustering (Euclidean distances) (Very Relevant)
Algorithm
 choose number of partitions \(M\)
 Choose parameters:
 initial noise (\(\beta_0\))
 final noise (\(\beta_f\))
 annealing factor \(\eta\)
 convergence criterion \(\theta\)
 initialize prototypes
 while \(\beta < \beta_f\)
 repeat EM until
 compute assignment probabilities (look up formular)
 Compute new prototypes \(\underline w_q^{new} = \frac{\sum_\alpha^p (m_q^{(\alpha)})_Q \underline x^{(\alpha)}}{\sum_\alpha^p (m_q^{(\alpha)})_Q}\forall q\)
 \[\beta \stackrel{+}{=} \eta\beta\]
SelfOrganising Maps (SOM)
SelfOrganizing Maps are useful in dimensionality reduction while preserving neighborhood. At the same time it is a clustering method as opposed to LLE, which projects all data points into the lower dimensional space(see below).
Parameters
 \(M\) partitions/neurons
 annealing schedule learning rate \(\varepsilon\) and annealing factor \(\sigma\)
 prototypes \(\underline w_{\underline q}\) center of mass \(\pm\) random noise
Algorithm
 for x_a in random(x)
 get closest prototype
Note: \(p\) is in the map space while \(\underline x^{(\alpha)}\) is in data space
 Change all prototypes using:
Example choice of \(h(\underline q, \underline p)\)
\[h(\underline q, \underline p) = e^{ \frac{(\underline q  \underline p)^2}{2\delta^2} }\]Annealing \(\sigma\)
 Decrease \(\sigma\) slowly over time
This algorithm would be similar to kmeans if one were to update only \(\underline q \text{ given that } \underline q = \underline p\) with \(\sigma = 0\). It would mean that only the closest prototype would be updated by the squared error. Also the exponential is not contained in kMeans.
Locally Linear Embedding
“locally linear embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data.”^{2} In Locally Linear Embedding the idea is to split up the data into small patches and then for each patch and data point there are the weights \(\underline W\)that allow an optimal reconstruction of the data points as well as a embedding coordinates \(\underline{U}\)
Algorithm
Parameters \(K,M\)
 find K nearest neighbours of \(KNN(\underline x ^{(\alpha)})=\{\beta_1^{(\alpha)},...,\beta_K^{(\alpha)}\forall{\alpha = 1,...,p}\}\)
 calculate reconstruction weights \(\underline W\)
 calculate embedding coordinates \(\underline U\)
2. Cost function
\(E\) is our error function. \(p\) is the number of data points. \(W_{\alpha\beta}\) are the weights
\[E(\underline W) = \sum^p_{\alpha=1}\left\underline x^{(\alpha)}\sum^p_{\beta=1} W_{\alpha\beta} \underline x ^{(\beta)}\right^2 \overset{!}{=} min \\ \text{s.t. } W_{\alpha\beta} = 0 \text{ if } \beta \notin KNN(x^{(\alpha)}) \\ \sum^p_{\beta=1} W_{\alpha\beta} = 1\]LLE is very good at preserving neighboorhoods in high dimensional data. Is Convex