## Density Estimation

The goal of density estimation is to be able to give a density estimation for each coordinate in the vector space.

There are two approaches

• parametric (model based)
• Gaussian Densities
• nonparametric (data driven)
• Kernel Density Estimate

### Kernel Density Estimation (exemplary with Gliding Histogram)

Parameter

• $$h$$ width of rectangle

Histogram Kernel $$H$$

• $$\underline x$$ are the coordinates at which we want to measure the density
• $$\underline u$$ is the normalized (well, to 1/2 normalized. Why would anyone do that?) distance between two points.

Does the vector given by $$\underline u$$ end outside our rectangle with width $$h$$?

\begin{align*} H(\underline u) = \begin{cases} 1,&|u_j|<\frac{1}{2},\forall j\in 1,...,n\\ 0,&else \end{cases} \end{align*}

The estimation of density

• $$h$$ width of the rectangle
• $$n$$ number of dimensions
$\hat{P}(\underline x) = \underset{\text{divide by area with width }h}{\frac{1}{h^n}} \cdot \frac{1}{p}\sum^p_{\alpha=1}H(\frac{\underline x- \underline x^{(\alpha)}}{h})$

Drawbacks of Gliding Histograms

• “Bumpy” whenevery a new data point falls into the rectangle (especially with few data points or high dimensionality)
• Rectangle not really a good choice
• Optimal size of $$h$$ non-trivial - needs model selection. lower h leads to overfitting

** Alternatively Gaussian**

Also a Gaussian kernel instead of the rectangle can be used, which reduces most of the side efects.

### Parametric Density Estimation

TODO: Figure out what $$\underline \mu ^*$$ and $$\underline\sum$$ mean (they compose $$\underline w$$)

Parametric Density estimation finds a good value for $$h$$.

Family of parametric density functions: \hat{P}(\underline x;\underline w)

Cost function for model selection

$E^T = -\frac{1}{p}\sum^p_{\alpha=1}ln\hat{P}(\underline x^{(\alpha)};\underline w) \overset{!}{=}min_{(\underline w)}$

Problem: Minimizing the training costs leads to overfitting

==> We needs $$E^G$$, the generalization costs, but they rely on the knowledge of $$P$$ ==> Use a proxy function

$\hat{E}^G=\frac{1}{p}\sum^p_{\beta=1}e^{(\beta)}$

Alternative approach: Select the model that gives the highest probability for the already known data points.

$\hat{P}(\{\underline x^{(\alpha)}\};\underline w )\overset{!}{=}\underset{(\underline w)} max \\ -\sum^p_{\alpha=1}ln\hat{P}(\underline x^{(\alpha)};\underline w)\overset{!}{=}min$

\begin{align*} \frac{\partial E^T}{\partial \underline\mu} & =\underline 0 & \implies \underline \mu ^* & = \frac{1}{p}\sum^p_{\alpha=1}\underline x ^{(\alpha)} \\ \frac{\partial E^T}{\partial \underline\sum} & =\underline 0 & \implies \underline \sum ^* & = \frac{1}{p}\sum^p_{\alpha=1}(\underline x ^{(\alpha)}-\underline \mu ^*)(\underline x^{(\alpha)}-\underline \mu^*)^T \\ \end{align*}