Kernel Trick - Intuition

TODO:

kernel trick is a key innovation based on the observation that many machine learning algorithms can be written in terms of dot-products between training examples. For example, it can be shown that the linear function 𝜃^𝑇𝒙 used by Linear SVM or Binomial Logistic Regression can be re-written as:

𝑓(𝒙) = 𝜃^𝑇𝒙
𝑓(𝒙) = 𝛴_{1≤𝑖≤𝑛} [ 𝛼_𝑖 𝒙^𝑇𝒙_𝑖 ]

where:

𝑛 - is the number of training examples
𝒙_𝑖 - is the 𝑖^th training example
𝛼_𝑖 - is a weight coefficient

we could replace:

𝒙 with the output of a feature function 𝜙(𝒙)
𝒙^𝑇𝒙_𝑖 with kernel function defined as a dot-product (i.e. 𝑘(𝒙,𝒙_𝑖) = 𝜙(𝒙)^𝑇𝜙(𝒙_𝑖))

after replacing dot-products with kernel evaluations, we can make predictions using the function:

𝑓(𝒙) = 𝛴_{1≤𝑖≤𝑛} [ 𝛼_𝑖 𝑘(𝒙,𝒙_𝑖) ]

the kernel-based function is exactly equivalent to preprocessing the data by applying 𝜙(𝒙) to all inputs, then learning a linear model in the newly transformed space

by defining a new kernel function kernel function, it allows:

to learn models that are nonlinear as a function of 𝒙 using convex optimization techniques (the optimization algorithm can view the decision function as being linear in a different space)
the kernel function 𝑘 to be computationally more efficient than naively constructing two 𝜙(𝒙) vectors and explicitly taking their dot product (see below)

𝜙(𝒙) can even be infinite-dimensional which would result in an infinite computational cost for the naive explicit approach. In many cases, the kernel function 𝑘(𝒙,𝒙') is a nonlinear tractable function of 𝒙 even when 𝜙(𝒙) is intractable

Computational Efficiency of Kernel Functions

based on: https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93

Kernels are the secret sauce that makes SVMs work

So far:

for linearly separable data SVMs work amazingly well
for data that’s almost linearly separable, SVMs can still be made to work pretty well by using the right value of 𝐶
for data that’s not linearly separable, we can project data to a space where it is perfectly/almost linearly separable, which reduces the problem to 1 or 2 and we are back in business

It looks like a big part of what makes SVMs universally applicable is projecting it to higher dimensions. And this is where kernels come in.

A very surprising aspect of SVMs is that in all of the mathematical machinery it uses, the exact projection, or even the number of dimensions, doesn’t show up. You could write all of it in terms of the dot products between various data points (represented as vectors). For 𝑝-dimensional vectors 𝑖 and 𝑗 where the first subscript on a dimension identifies the point and the second indicates the dimension number:

𝑥_𝑖 = (𝑥_𝑖1, 𝑥_𝑖2, ..., 𝑥_𝑖𝑝)
𝑥_𝑗 = (𝑥_𝑖1, 𝑥_𝑗2, ..., 𝑥_𝑗𝑝)

The dot product is defined as:

𝑥_𝑖·𝑥_𝑗 = 𝑥_𝑖1𝑥_𝑗1 + 𝑥_𝑖2𝑥_𝑗2+ ... + 𝑥_𝑖𝑝𝑥_𝑗𝑝

If we have n points in our dataset, the SVM needs only the dot product of each pair of points to find a classifier. Just that. This is also true when we want to project data to higher dimensions. We don’t need to provide the SVM with exact projections; we need to give it the dot product between all pairs of points in the projected space.

This is relevant because this is exactly what kernels do. A kernel, short for kernel function, takes as input two points in the original space, and directly gives us the dot product in the projected space.

Let’s revisit the projection we did before, and see if we can come up with a corresponding kernel. We will also track the number of computations we need to perform for the projection and then finding the dot products — to see how using a kernel compares.

For a point 𝑖:

𝑥_𝑖= (𝑥_𝑖1, 𝑥_𝑖2)

Our corresponding projected point was:

𝑋_𝑖 = (𝑥_𝑖1², 𝑥_𝑖2², 2^(1/2)𝑥_𝑖1𝑥_𝑖2)

To compute this projection we need to perform the following operations:

To get the new first dimension: 1 multiplication
Second dimension: 1 multiplication
Third dimension: 2 multiplications

In all, 1+1+2 = 4 multiplications.

The dot product in the new dimension is:

𝑋_𝑖𝑋_𝑗= 𝑋_𝑖1𝑋_𝑗1 + 𝑋_𝑖2𝑋_𝑗2 + 𝑋_𝑖3𝑋_𝑗3

To compute this dot product for two points i and j, we need to compute their projections first. So that’s 4+4 = 8 multiplications, and then the dot product itself requires 3 multiplications and 2 additions.

In all, that’s:

Multiplications: 8 (for the projections) + 3 (in the dot product) = 11 multiplications
Additions: 2 (in the dot product)

Which is total of 11 + 2 = 13 operations.

I claim this kernel function gives me the same result:

𝐾(𝑥_𝑖,𝑥_𝑗) = (𝑥_𝑖·𝑥_𝑗)²

We take the dot product of the vectors in the original space first, and then square the result.

Let expand it out and check whether my claim is indeed true:

𝐾(𝑥_𝑖,𝑥_𝑗) = (𝑥_𝑖·𝑥_𝑗)²
𝐾(𝑥_𝑖,𝑥_𝑗) = (𝑥_𝑖1𝑥_𝑗1 + 𝑥_𝑖2𝑥_𝑗2)²
𝐾(𝑥_𝑖,𝑥_𝑗) = 𝑥_𝑖1²𝑥_𝑗1² + 𝑥_𝑖2²𝑥_𝑗2² + 2𝑥_𝑖1𝑥_𝑗1𝑥_𝑖2𝑥_𝑗2
𝐾(𝑥_𝑖,𝑥_𝑗) = (𝑥_𝑖1², 𝑥_𝑖2², 2^(1/2)𝑥_𝑖1𝑥_𝑖2) · (𝑥_𝑗1², 𝑥_𝑗2², 2^(1/2)𝑥_𝑗1𝑥_𝑗2)

It is. How many operations does this need? Look at step (2) above. To compute the dot product in two dimensions I need 2 multiplications and 1 addition. Squaring it is another multiplication.

So, in all:

Multiplications: 2 (for the dot product in the original space) + 1 (for squaring the result) = 3 multiplications
Additions: 1 (for the dot product in the original space)

A total of 3 + 1 = 4 operations. This is only 31% of the operations we needed before.

It looks like it is faster to use a kernel function to compute the dot products we need. It might not seem like a big deal here: we’re looking at 4 vs 13 operations, but with input points with a lot more dimensions, and with the projected space having an even higher number of dimensions, the computational savings for a large dataset add up incredibly fast. So that’s one huge advantage of using kernels.

Most SVM libraries already come pre-packaged with some popular kernels like Polynomial, Radial Basis Function (RBF), and Sigmoid. When we don’t use a projection (as in our first example in this article), we compute the dot products in the original space — this we refer to as using the linear kernel.

Many of these kernels give you additional levers to further tune it for your data. For example, the polynomial kernel:

𝐾(𝑥_𝑖,𝑥_𝑗) = (𝑥_𝑖·𝑥_𝑗 + 𝑐)²

allows you to pick the value of 𝑐 and 𝑑 (the degree of the polynomial). For the 3D projection above, I had used a polynomial kernel with 𝑐=0 and 𝑑=2.

But we are not done with the awesomeness of kernels yet!

Remember I mentioned projecting to infinite dimensions a while back? If you haven’t already guessed, the way to make it work is to have the right kernel function. That way, we really don’t have to project the input data, or worry about storing infinite dimensions.

A kernel function computes what the dot product would be if you had actually projected the data.

The RBF kernel is commonly used for a specific infinite-dimensional projection. We won’t go into the math of it here, but look at the references at the end of this article.

How can we have infinite dimensions, but can still compute the dot product? If you find this question confusing, think about how we compute sums of infinite series. This is similar. There are infinite terms in the dot product, but there happens to exist a formula to calculate their sum.

This answers the questions we had asked in the previous section. Let’s summarize:

We typically don’t define a specific projection for our data. Instead, we pick from available kernels, tweaking them in some cases, to find one best suited to the data.
Of course, nothing stops us from defining our own kernels, or performing the projection ourselves, but in many cases we don’t need to. Or we at least start by trying out what’s already available.
If there is a kernel available for the projection we want, we prefer to use the kernel, because that’s often faster.
RBF kernels can project points to infinite dimensions.

Kernel Trick

[data-colorid=tys8bmlqk6]{color:#808080} html[data-color-mode=dark] [data-colorid=tys8bmlqk6]{color:#7f7f7f}Kernel Trick - Intuition

Computational Efficiency of Kernel Functions

Kernel Trick - Intuition