Kernel Trick
Kernel Trick - Intuition
TODO:
kernel trick is a key innovation based on the observation that many machine learning algorithms can be written in terms of dot-products between training examples. For example, it can be shown that the linear function 𝜃𝑇𝒙 used by Linear SVM or Binomial Logistic Regression can be re-written as:
- 𝑓(𝒙) = 𝜃𝑇𝒙
- 𝑓(𝒙) = 𝛴1≤𝑖≤𝑛 [ 𝛼𝑖 𝒙𝑇𝒙𝑖 ]
where:
- 𝑛 - is the number of training examples
- 𝒙𝑖 - is the 𝑖th training example
- 𝛼𝑖 - is a weight coefficient
we could replace:
- 𝒙 with the output of a feature function 𝜙(𝒙)
- 𝒙𝑇𝒙𝑖 with kernel function defined as a dot-product (i.e. 𝑘(𝒙,𝒙𝑖) = 𝜙(𝒙)𝑇𝜙(𝒙𝑖))
after replacing dot-products with kernel evaluations, we can make predictions using the function:
- 𝑓(𝒙) = 𝛴1≤𝑖≤𝑛 [ 𝛼𝑖 𝑘(𝒙,𝒙𝑖) ]
the kernel-based function is exactly equivalent to preprocessing the data by applying 𝜙(𝒙) to all inputs, then learning a linear model in the newly transformed space
by defining a new kernel function kernel function, it allows:
- to learn models that are nonlinear as a function of 𝒙 using convex optimization techniques (the optimization algorithm can view the decision function as being linear in a different space)
- the kernel function 𝑘 to be computationally more efficient than naively constructing two 𝜙(𝒙) vectors and explicitly taking their dot product (see below)
𝜙(𝒙) can even be infinite-dimensional which would result in an infinite computational cost for the naive explicit approach. In many cases, the kernel function 𝑘(𝒙,𝒙') is a nonlinear tractable function of 𝒙 even when 𝜙(𝒙) is intractable
Computational Efficiency of Kernel Functions
based on: https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
Kernels are the secret sauce that makes SVMs work
So far:
- for linearly separable data SVMs work amazingly well
- for data that’s almost linearly separable, SVMs can still be made to work pretty well by using the right value of 𝐶
- for data that’s not linearly separable, we can project data to a space where it is perfectly/almost linearly separable, which reduces the problem to 1 or 2 and we are back in business
It looks like a big part of what makes SVMs universally applicable is projecting it to higher dimensions. And this is where kernels come in.
A very surprising aspect of SVMs is that in all of the mathematical machinery it uses, the exact projection, or even the number of dimensions, doesn’t show up. You could write all of it in terms of the dot products between various data points (represented as vectors). For 𝑝-dimensional vectors 𝑖 and 𝑗 where the first subscript on a dimension identifies the point and the second indicates the dimension number:
- 𝑥𝑖 = (𝑥𝑖1, 𝑥𝑖2, ..., 𝑥𝑖𝑝)
- 𝑥𝑗 = (𝑥𝑖1, 𝑥𝑗2, ..., 𝑥𝑗𝑝)
The dot product is defined as:
- 𝑥𝑖·𝑥𝑗 = 𝑥𝑖1𝑥𝑗1 + 𝑥𝑖2𝑥𝑗2 + ... + 𝑥𝑖𝑝𝑥𝑗𝑝
If we have n points in our dataset, the SVM needs only the dot product of each pair of points to find a classifier. Just that. This is also true when we want to project data to higher dimensions. We don’t need to provide the SVM with exact projections; we need to give it the dot product between all pairs of points in the projected space.
This is relevant because this is exactly what kernels do. A kernel, short for kernel function, takes as input two points in the original space, and directly gives us the dot product in the projected space.
Let’s revisit the projection we did before, and see if we can come up with a corresponding kernel. We will also track the number of computations we need to perform for the projection and then finding the dot products — to see how using a kernel compares.
For a point 𝑖:
- 𝑥𝑖 = (𝑥𝑖1, 𝑥𝑖2)
Our corresponding projected point was:
- 𝑋𝑖 = (𝑥𝑖12, 𝑥𝑖22, 2(1/2)𝑥𝑖1𝑥𝑖2)
To compute this projection we need to perform the following operations:
- To get the new first dimension: 1 multiplication
- Second dimension: 1 multiplication
- Third dimension: 2 multiplications
In all, 1+1+2 = 4 multiplications.
The dot product in the new dimension is:
- 𝑋𝑖𝑋𝑗 = 𝑋𝑖1𝑋𝑗1 + 𝑋𝑖2𝑋𝑗2 + 𝑋𝑖3𝑋𝑗3
To compute this dot product for two points i and j, we need to compute their projections first. So that’s 4+4 = 8 multiplications, and then the dot product itself requires 3 multiplications and 2 additions.
In all, that’s:
- Multiplications: 8 (for the projections) + 3 (in the dot product) = 11 multiplications
- Additions: 2 (in the dot product)
Which is total of 11 + 2 = 13 operations.
I claim this kernel function gives me the same result:
- 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖·𝑥𝑗)2
We take the dot product of the vectors in the original space first, and then square the result.
Let expand it out and check whether my claim is indeed true:
- 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖·𝑥𝑗)2
- 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖1𝑥𝑗1 + 𝑥𝑖2𝑥𝑗2)2
- 𝐾(𝑥𝑖,𝑥𝑗) = 𝑥𝑖12𝑥𝑗12 + 𝑥𝑖22𝑥𝑗22 + 2𝑥𝑖1𝑥𝑗1𝑥𝑖2𝑥𝑗2
- 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖12, 𝑥𝑖22, 2(1/2)𝑥𝑖1𝑥𝑖2) · (𝑥𝑗12, 𝑥𝑗22, 2(1/2)𝑥𝑗1𝑥𝑗2)
It is. How many operations does this need? Look at step (2) above. To compute the dot product in two dimensions I need 2 multiplications and 1 addition. Squaring it is another multiplication.
So, in all:
- Multiplications: 2 (for the dot product in the original space) + 1 (for squaring the result) = 3 multiplications
- Additions: 1 (for the dot product in the original space)
A total of 3 + 1 = 4 operations. This is only 31% of the operations we needed before.
It looks like it is faster to use a kernel function to compute the dot products we need. It might not seem like a big deal here: we’re looking at 4 vs 13 operations, but with input points with a lot more dimensions, and with the projected space having an even higher number of dimensions, the computational savings for a large dataset add up incredibly fast. So that’s one huge advantage of using kernels.
Most SVM libraries already come pre-packaged with some popular kernels like Polynomial, Radial Basis Function (RBF), and Sigmoid. When we don’t use a projection (as in our first example in this article), we compute the dot products in the original space — this we refer to as using the linear kernel.
Many of these kernels give you additional levers to further tune it for your data. For example, the polynomial kernel:
- 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖·𝑥𝑗 + 𝑐)2
allows you to pick the value of 𝑐 and 𝑑 (the degree of the polynomial). For the 3D projection above, I had used a polynomial kernel with 𝑐=0 and 𝑑=2.
But we are not done with the awesomeness of kernels yet!
Remember I mentioned projecting to infinite dimensions a while back? If you haven’t already guessed, the way to make it work is to have the right kernel function. That way, we really don’t have to project the input data, or worry about storing infinite dimensions.
A kernel function computes what the dot product would be if you had actually projected the data.
The RBF kernel is commonly used for a specific infinite-dimensional projection. We won’t go into the math of it here, but look at the references at the end of this article.
How can we have infinite dimensions, but can still compute the dot product? If you find this question confusing, think about how we compute sums of infinite series. This is similar. There are infinite terms in the dot product, but there happens to exist a formula to calculate their sum.
This answers the questions we had asked in the previous section. Let’s summarize:
- We typically don’t define a specific projection for our data. Instead, we pick from available kernels, tweaking them in some cases, to find one best suited to the data.
- Of course, nothing stops us from defining our own kernels, or performing the projection ourselves, but in many cases we don’t need to. Or we at least start by trying out what’s already available.
- If there is a kernel available for the projection we want, we prefer to use the kernel, because that’s often faster.
- RBF kernels can project points to infinite dimensions.