# Estimation theory - Kernel Density Estimation

## Chapters

- General Terms and tools
- PCA
- PCA
- Hebbian Learning
- Kernel-PCA

- Source Separation
- ICA
- Infomax ICA
- Second Order Source Separation
- FastICA

- Stochastic Optimization
- Clustering
- k-means Clustering
- Pairwise Clustering
- Self-Organising Maps
- Locally Linear Embedding

- Estimation Theory
- Density Estimation
- Kernel Density Estimation
- Parametric Density Estimation

- Mixture Models - Estimation Models

- Density Estimation

## Density Estimation

The goal of density estimation is to be able to give a density estimation for each coordinate in the vector space.

There are two approaches

- parametric (model based)
- Gaussian Densities

- nonparametric (data driven)
- Kernel Density Estimate

### Kernel Density Estimation (exemplary with Gliding Histogram)

**Parameter**

- width of rectangle

Histogram Kernel

- are the coordinates at which we want to measure the density
- is the normalized (well, to 1/2 normalized. Why would anyone do that?) distance between two points.

Does the vector given by end outside our rectangle with width ?

The estimation of density

- width of the rectangle
- number of dimensions

**Drawbacks of Gliding Histograms**

- “Bumpy” whenevery a new data point falls into the rectangle (especially with few data points or high dimensionality)
- Rectangle not really a good choice
- Optimal size of non-trivial - needs model selection. lower h leads to overfitting

** Alternatively Gaussian**

Also a Gaussian kernel instead of the rectangle can be used, which reduces most of the side efects.

### Parametric Density Estimation

TODO: Figure out what and mean (they compose )

Parametric Density estimation finds a good value for .

Family of parametric density functions: $$\hat{P}(\underline x;\underline w)

**Cost function for model selection**

Problem: Minimizing the training costs leads to overfitting

==> We needs , the generalization costs, but they rely on the knowledge of ==> Use a proxy function

Alternative approach: Select the model that gives the highest probability for the already known data points.

Probably simple gradient descent

**Conditions for multivariate cases**