Source Separation (ICA)
Chapters
 General Terms and tools
 PCA
 PCA
 Hebbian Learning
 KernelPCA
 Source Separation
 ICA
 Infomax ICA
 Second Order Source Separation
 FastICA
 Stochastic Optimization
 Clustering
 kmeans Clustering
 Pairwise Clustering
 SelfOrganising Maps
 Locally Linear Embedding
 Estimation Theory
 Density Estimation
 Kernel Density Estimation
 Parametric Density Estimation
 Mixture Models  Estimation Models
 Density Estimation
Independent Component Analysis (ICA)
ICA allows the reconstruction of mixed signals. This could for example be multiple speakers on one audio track.
Requirements
 Needs some prior knowledge
 The number of sources to be recovered is a parameter to the algorithm. It is possible to choose a higher number of sources and afterwards remove sources that are only noise.
Limitations
 sources must have nonGaussian distributions
 source amplitudes cannot be recovered
\(p\) number of observations
\(N\) number of dimensions
\(p * N + N ^2\) unknowns
Syntax
\(\underline x\) observations
\(\underline A\) Mixing Matrix (must be invertible for ICA to work)
\(\underline s\) Sources
\(\hat{\underline s}\) Recovered/Estimated Sources
\(\underline W\) Unmixing matrix
General principal
\[\underline x = \underline A \cdot \underline s \implies \hat{\underline s} = \underline W \cdot \underline x\]Different methods leverage different prior knowledge
Cost functions
 Vanishing crosscorrelation functions (QDIAG, FFDIAG) (more or less invented by the institute)
 nonGaussianity (fastICA)
 Statistical Independence \(D_{KL}(P_{\underline s}(\hat{\underline s}))\)
 Infomax \(H(\hat{\underline u})\)
Infomax ICA
Variables
 See above
 \(P\) Probability distribution of th target space
 \(H\) our cost function
 \[\hat{u}_i = \hat f_i(\sum_j w_{ij}x_j)\]
 \(\hat u\) is the vector with the results of our approximation function
 \(\hat f_i\) is a freely chosen sigmoid cdf function that depends on \(\hat s_i\). \(\left\hat f'_i(\hat s_i) \right = \hat P_{s_i}(\hat s_i) \implies \hat f _i(\hat s_i) = \int_{\infty}^{\hat s_i} dy\hat P_{s_i} (y)\)
 \(e^{(\alpha)}\) training error of a specific datapoint.
Infomax ICA uses empirical risk minimization(ERM) to do gradient ascent.
Example for \(\hat f_i\)
\[\begin{align*} \hat f_{(y)} & =\frac{1}{1+exp(y)} \\ \frac{\hat f''_{(y)}}{\hat f'_{(y)}} & =12\hat f_{(y)} \end{align*}\]We want to maximize \(E^G\) which relies on the real distribution of the data. Since this distribution is unkown we use the proxy error function \(E^T\). \(E^T\) uses a sum over all the data points instead of an integral over the real distribution. Otherwise the two are identical.
\[\begin{align*} E^G & = ln  det\underline W + \int d\underline x P_\underline x (\underline x ) \left\{\sum^N_{l=1}ln(\hat{f'}_l)\left(\sum^N_{k=1} w_{lk}x_k\right)\right\} \\ E^T & = ln  det\underline W + \frac{1}{p}\sum^p_{\alpha=1} \left\{\sum^N_{l=1}ln(\hat{f'}_l)\left(\sum^N_{k=1} w_{lk}x_k\right)\right\} \end{align*}\]The Infomax cost function Infomax is the measure of how much mutual information two random variables have. It can also be expressed as the KL divergence(is equivalent?). We try to minimize the mutual information, because we believe that our sources are independent This is why we need to maximize the cost function
\[H =  \int d\hat{\underline u}P_{\underline u}(\hat{\underline u}) ln P_{\underline u}\hat{\underline u}\overset{!}{=}max\]Possible exam task Independent source signals work better if they match \(\hat f_i\).
Find a suitable \(\hat f_i\)
Gradient ascent for w
 Batch Learning \(\Delta w_{ij}=\frac{\partial E^T}{\partial w_{ij}} = \frac{\epsilon}{p}\sum_\alpha^p \frac{\partial e^{(\alpha)}}{\partial w_{ij}}\)
 Online Learning \(\Delta w_{ij}=\varepsilon\frac{\partial e^{(\alpha)}}{\partial w_{ij}}\) Taylor approximation used because inverse is hard to find.
Natural Gradient
The natural gradient is faster than normal gradient descent.
While normal gradient changes the parameters by a fixed rate, the natural gradient changes the outcome distribution by a constant distance. This distance in distributions is measured by the KL Divergence.^{1}
Step size is normalized at each step \(d\underline Z = d\underline W \cdot \underline W^{1}\)
Second Order (Blind)Source Separation
SOBSS allows the separation of sources after temporal shift or some other form of noise. sources are not iid but assumed to be time correlated / sequential.
Principle (As ambiguous as necessary because I just transcribed this in the lecture) Try different \(w\) so that \(\hat s\) matches in height.
Parameters
 \(\tau\) time shift
Cost function minimization depends on the algorithm used on top.
For additional robustness remove \(\tau = 0\) from the set of possible solutions. This avoids some trivial solutions.
FastICA
Prerequisites
 Whitened data
 PCA
 Kurtosis is known (prior knowledge)
Limitations
 sensitive to outliers
Additional Parameters
 \[\underline u\]
\(\overset{\sim}{\underline A}\) is orthogonal matrix.
Find the maximum negentropy based on contrast function.
\[\begin{align*} kurt(\hat s) & \overset{!}{=}max_\underline z\\& s.t.\\ var(\hat s)=z^2_1 + z_2^2 & \overset {!}{=} 1 \\ kurt(\hat s) & =z^4_1+z^4_2 \end{align*}\]We can show that
\[\underline z_{opt} = \begin{pmatrix} 0\\ \pm 1 \end{pmatrix} \text{ or } \underline z_{opt} = \begin{pmatrix} \pm 1 \\ 0 \end{pmatrix}\]Optimize for curtosis
 Batch learning
 Online Learning
Batch Learning
 loop
 Initialize \(\underline b\) with random vector of unit length
 \[\delta \underline b = \varepsilon sgn [kurt(\underline b^T\underline u)] \langle \underline u (\underline b ^T u )^3\rangle\]
 normalize \(\underline b\)
Kurtosis
 <0: subgaussian. (looks like rectangle) uniform
 0: Gaussian normal
 >0: supergaussian(looks like triangle) laplace
Footnotes

Kevin Frans 2016, A[sic!] intuitive explanation of natural gradient descent ↩