Support Vector Machines

These are exam preparation notes, subpar in quality and certainly not of divine quality.

See the index with all articles in this series

If you are stuck, read Wikipedia in parallel.

The goal of SVMs is to divide two groups with a line that separates the data points as clearly as possible.

There are two cases:

Data points can be cleanly split into their classes
At least some data points overlap making it impossible to lay a line between datapoints.

Clean split

With an SVM you want to make a binary classification of points and predict class membership.

two point clouds separated by multiple candidate lines

From ZackWeinberg Wikipedia CC BY_SA 3.0

The label is assigned by calculating on which side of a hyperplane a point lies.

\[y = sign(\underline w^T \underline x + b)\]

Where \(\underline w\) and \(b\) are your set of parameters for a linear connectionist neuron.

Our goal is to find a line that cuts as “clean” as possible throught the two groups. Finding lines that separate the two groups

\[\underline w^T \underline x + b = 0\]

is ambiguous as can be seen in the graphic above (H2 and H3). In this graphic H3 is the optimal solution. How do we maximize the distance to the two point clouds?

\[\underset{\alpha = 1,...,p}{min} | \underline w ^T \underline x^{(\alpha)}+b| \overset{!}{=}1\]

What we thereby do is maximize the distance of the split line to the closest points. This is very prone to outliers, as the closest values determine the parameters of the line.

Optimization problems

Primal
\[d_w = \frac {1}{||\underline w || } \overset {!}{=} max\]
Dual: Provides lower bound to the solution of the primal problem
\[TODO\]

Unclean split - C-SVM

When we cannot put a straight line through the data points, we cannot only optimize for the above, but need to add a regularization term. (Kind of adding prior knowledge)

\[\frac{1}{2}||\underline w || ^2 + \frac {C}{p}\sum^p_{\alpha=1}\varphi_\alpha \overset{!}{=}min\]

with

C regularization parameter
\(\varphi\) (squared???) error
Kernels

Kernels are often used to identify more complex shapes. Typical kernel functions are:

polynomial of degree d
RBF kernel with range \(\sigma\)
neural network (excluded from exam)
plummer kernel (excluded from exam)

Multiclass classification

As SVMs were designed as binary classifiers, they cannot be directly used for multiclass classification.

What is usually done is combining multiple binary classifiers via a couple of perceptrons.