Density Estimation

The goal of density estimation is to be able to give a density estimation for each coordinate in the vector space.

There are two approaches

  • parametric (model based)
    • Gaussian Densities
  • nonparametric (data driven)
    • Kernel Density Estimate

Kernel Density Estimation (exemplary with Gliding Histogram)


  • \(h\) width of rectangle

Histogram Kernel \(H\)

  • \(\underline x\) are the coordinates at which we want to measure the density
  • \(\underline u\) is the normalized (well, to 1/2 normalized. Why would anyone do that?) distance between two points.

Does the vector given by \(\underline u\) end outside our rectangle with width \(h\)?

\[\begin{align*} H(\underline u) = \begin{cases} 1,&|u_j|<\frac{1}{2},\forall j\in 1,...,n\\ 0,&else \end{cases} \end{align*}\]

The estimation of density

  • \(h\) width of the rectangle
  • \(n\) number of dimensions
\[\hat{P}(\underline x) = \underset{\text{divide by area with width }h}{\frac{1}{h^n}} \cdot \frac{1}{p}\sum^p_{\alpha=1}H(\frac{\underline x- \underline x^{(\alpha)}}{h})\]

Drawbacks of Gliding Histograms

  • “Bumpy” whenevery a new data point falls into the rectangle (especially with few data points or high dimensionality)
  • Rectangle not really a good choice
  • Optimal size of \(h\) non-trivial - needs model selection. lower h leads to overfitting

** Alternatively Gaussian**

Also a Gaussian kernel instead of the rectangle can be used, which reduces most of the side efects.

Parametric Density Estimation

TODO: Figure out what \(\underline \mu ^*\) and \(\underline\sum\) mean (they compose \(\underline w\))

Parametric Density estimation finds a good value for \(h\).

Family of parametric density functions: $$\hat{P}(\underline x;\underline w)

Cost function for model selection

\[E^T = -\frac{1}{p}\sum^p_{\alpha=1}ln\hat{P}(\underline x^{(\alpha)};\underline w) \overset{!}{=}min_{(\underline w)}\]

Problem: Minimizing the training costs leads to overfitting

==> We needs \(E^G\), the generalization costs, but they rely on the knowledge of \(P\) ==> Use a proxy function


Alternative approach: Select the model that gives the highest probability for the already known data points.

\[\hat{P}(\{\underline x^{(\alpha)}\};\underline w )\overset{!}{=}\underset{(\underline w)} max \\ -\sum^p_{\alpha=1}ln\hat{P}(\underline x^{(\alpha)};\underline w)\overset{!}{=}min\]

Probably simple gradient descent

Conditions for multivariate cases

\[\begin{align*} \frac{\partial E^T}{\partial \underline\mu} & =\underline 0 & \implies \underline \mu ^* & = \frac{1}{p}\sum^p_{\alpha=1}\underline x ^{(\alpha)} \\ \frac{\partial E^T}{\partial \underline\sum} & =\underline 0 & \implies \underline \sum ^* & = \frac{1}{p}\sum^p_{\alpha=1}(\underline x ^{(\alpha)}-\underline \mu ^*)(\underline x^{(\alpha)}-\underline \mu^*)^T \\ \end{align*}\]

Mixture Models - EM