The Kernel Trick Explained
Analogy
Think of it like this:
Kernel is like a shortcut formula:
- The “long way”: Transform to high-dimensional space, then compute dot product → expensive
- The “shortcut”: Use kernel function directly in original space → cheap
- Result is the same!
Example:
Long way:
1. Transform x=[1,2] → φ(x)=[1, 4, √2·2, √2·1, √2·2, 1] (6D)
2. Transform y=[3,4] → φ(y)=[9, 16, √2·12, √2·3, √2·4, 1] (6D)
3. Compute φ(x)·φ(y) = 1×9 + 4×16 + ... = 121
Shortcut:
k(x,y) = (x·y)² = (11)² = 121 ✓ Same answer!
Why This Matters
The kernel trick allows us to:
- Work with infinite-dimensional feature spaces without computing them explicitly
- Make algorithms computationally tractable that would otherwise be impossible
- Get the benefits of non-linear transformations at the cost of computing a simple function
Common Kernels
Polynomial Kernel: \(k(x,y) = (x \cdot y + c)^d\)
- Computes dot product in polynomial feature space
- Degree \(d\) controls complexity
RBF (Gaussian) Kernel: \(k(x,y) = \exp(-\gamma \|x-y\|^2)\)
- Infinite-dimensional feature space
- \(\gamma\) controls width of influence
Linear Kernel: \(k(x,y) = x \cdot y\)
- Simplest case, no transformation
- Equivalent to original feature space
Key Insight
You never need to compute \(\phi(x)\) explicitly. The kernel function \(k(x,y)\) directly gives you \(\langle\phi(x), \phi(y)\rangle\) without ever constructing the high-dimensional vectors.
This is why SVMs and other kernel methods can work with extremely high (or even infinite) dimensional spaces efficiently!