Kernel Trick

Kernel Trick

Kernel Trick - Intuition

TODO:

kernel trick is a key innovation based on the observation that many machine learning algorithms can be written in terms of dot-products between training examples. For example, it can be shown that the linear function 𝜃𝑇𝒙 used by Linear SVM or Binomial Logistic Regression can be re-written as:

  • 𝑓(𝒙) = 𝜃𝑇𝒙
  • 𝑓(𝒙) = 𝛴1≤𝑖≤𝑛 [ 𝛼𝑖 𝒙𝑇𝒙𝑖 ]

where:

  • 𝑛 - is the number of training examples
  • 𝒙𝑖 - is the 𝑖th training example
  • 𝛼𝑖 - is a weight coefficient

we could replace:

  • 𝒙 with the output of a feature function 𝜙(𝒙)
  • 𝒙𝑇𝒙𝑖 with kernel function defined as a dot-product (i.e. 𝑘(𝒙,𝒙𝑖) = 𝜙(𝒙)𝑇𝜙(𝒙𝑖))

after replacing dot-products with kernel evaluations, we can make predictions using the function:

  • 𝑓(𝒙) = 𝛴1≤𝑖≤𝑛 [ 𝛼𝑖 𝑘(𝒙,𝒙𝑖) ]

the kernel-based function is exactly equivalent to preprocessing the data by applying 𝜙(𝒙) to all inputs, then learning a linear model in the newly transformed space

by defining a new kernel function kernel function, it allows:

  • to learn models that are nonlinear as a function of 𝒙 using convex optimization techniques (the optimization algorithm can view the decision function as being linear in a different space)
  • the kernel function 𝑘 to be computationally more efficient than naively constructing two 𝜙(𝒙) vectors and explicitly taking their dot product (see below)

𝜙(𝒙) can even be infinite-dimensional which would result in an infinite computational cost for the naive explicit approach. In many cases, the kernel function 𝑘(𝒙,𝒙') is a nonlinear tractable function of 𝒙 even when 𝜙(𝒙) is intractable

Computational Efficiency of Kernel Functions

based on: https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93

Kernels are the secret sauce that makes SVMs work

A kernel function computes what the dot product would be if you had actually projected the data.