# A Kernel of Truth

Let’s be honest. Machine learning appears to be producing astounding results at an absurd rate; reinforcement learning is being used to achieve super-human game-play, generative adversarial networks can create hyper-realistic images and sequence-to-sequence models are able to synthesise voices from (metaphorical) thin air. It is thus perhaps unsurprising that kernel methods were not high in priority on my list of machine learning techniques to explore. Upon reflection, however, their theoretical soundness render them a perfect counter-foil to what is a seemingly daunting and scarcely penetrable facade of deep learning.

Over the years, kernel methods have been very well researched. This copious amount of attention has naturally led to a considerable number of articles extolling its virtues. However, in amongst this plethora of material, it is easy to lose sight of the sheer simplicity of the idea. This post aims to highlight the elegance of the underlying principles and to present just why it is that they enable modellers to do something really quite exciting!

Having (hopefully) enticed the reader with that teaser, we take a leaf out of the book of many a good story-teller and step back to the beginning; to one of the most fundamental of machine learning problems - linear regression.

## The Setup

The problem statement of linear regression is one that is simple enough: Given a training set $S = \{(\mathbf{x}_{n}, y_{n})\}_{n=1}^{N}$ consisting of points $\mathbf{x}_{n} \in \mathbb{R}^{k}$ and labels $y_{n} \in \mathbb{R}$, find a hyper-plane in $\mathbb{R}^{k}$ that minimises some objective function such as mean squared error.

Formally, this can be stated as the problem of finding a vector $\mathbf{w}$ such that a linear function $f: \mathbb{R}^{k} \to \mathbb{R}$ defined as $f(\mathbf{x}) = \sum_{i=1}^{k}w_{i}x_{i}$ minimises the mean squared objective $\sum_{n=1}^{N} \bigg( y_{n} - f(\mathbf{x}_{n})\bigg)^{2}$.1

For notational convenience (even though it may not look like it now, it does simplify things down the road!), we can reformulate the problem as follows: if we construct a matrix $\mathbf{X}$ consisting of a different data point $\mathbf{x}_{n}$s in each row, the above problem may be rephrased as the search for a vector $\mathbf{w}$ that minimises the loss $\mathcal{L}(\mathbf{w}, \mathbf{X}, \mathbf{y}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^{2}$ where $\|.\|$ denotes the $L_{2}$ norm of a vector.

Notice our choice of article in referring to ‘a’ vector $\mathbf{w}$ satisfying certain conditions, rather than ‘the’ such vector. The reasoning for this is simple: even if we are guaranteed the existence of such a $\mathbf{w}$, there is no guarantee of its uniqueness. To see why, consider the case when $n < k$. Observe then, that the minima of $\mathcal{L}(\mathbf{w}, \mathbf{X}, \mathbf{y}) = 0$ can be achieved by simply solving a linear system of $n$ equations for $k$ unknowns. Linear algebra then tells us that there are infact infinitely many such solutions! This challenge motivates what is known as ridge regression. In ridge regression, we strive to find ‘the’ vector $\mathbf{w}$ with the smallest norm which minimises the mean squared error. Formally, we have modified our loss to be $\mathcal{L}_{\lambda}(\mathbf{w}, \mathbf{X}, \mathbf{y}) = \lambda \|\mathbf{w}\|^{2} + \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^{2}, \quad \lambda > 0$

To understand why this has a unique solution, we differentiate $\mathcal{L}_{\lambda}(\mathbf{w}, \mathbf{X}, \mathbf{y})$ with respect to $\mathbf{w}$ and equate it to $0$ (a necessary condition to satisfy the minimisation problem) to obtain the equation $\bigg( \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}_{k} \bigg) \mathbf{w} = \mathbf{X}^{T} \mathbf{y}$ Thus, we can see that this equation has a unique solution if and only if $\mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}_{k}$ is invertible. As it turns out, this is indeed the case for $\lambda > 0$!

Computing such a $\mathbf{w} = \big( \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}_{k} \big)^{-1} \mathbf{X}^{T} \mathbf{y}$ involves the inversion of a $k \times k$ matrix whose computational complexity is $\mathcal{O}(k^{3})$. Surely, that seems like a reasonable cost. At least it is not cubic in the number of data points, $N$, which is typically much larger than $k$!

## The Rise

At this juncture, one might be inclined to rest on our laurels and bask in the uniqueness of this magical vector $\mathbf{w}$. Your reverie should soon be broken by the dawning realisation that, while it is commendable that we managed to address so many limitations of linear regression, in reality, how often does a plane fit our data even remotely well? Rarely, if ever! So if we required an entire section (plus footnotes) to introduce linear regression, the thought of generalising this to the non-linear setting might be inducing that sinking feeling in the pit of your stomach.

However, as the title for this section suggests, now is not the time for despair! Consider the following argument: Suppose you have a mapping $\phi: \mathbb{R}^{k} \to \mathbb{R}^{K}$ which transforms the original $k$ features into $K$ new features such that the non-linear relations in your data are converted into linear ones. Now, if you proceed to perform a linear regression on this transformed dataset $\tilde{S} = \{(\phi(\mathbf{x}_{n}), y_{n})\}_{n=1}^{N}$, you would end up finding a hyper-plane in $\phi(\mathbb{R}^{k})$, which in turn corresponds to a non-linear curve fit to our data! Effectively, what this means is that given a new test point $\mathbf{x}^{*} \in \mathbb{R}^{k}$, we can simply map it (via $\phi$) to $\mathbb{R}^{K}$ and then use our linear regression prediction. So basically we ended up getting non-linear regression for free! Right? RIGHT?

## The Fall

Well, not quite. While it is true that we could simply “perform linear regression on $\tilde{S}$”, what was brushed under the carpet was the fact that the linear regression is now happening in $K$-dimensional space and as such, could have a computational complexity of $\mathcal{O}(K^{3})$. For all but the most trivial of mappings, $K >> k$ which would pose a challenge.

Consider, for example, the case of $k = 2$ and suppose we wanted to extend our list of potential regressors to all degree-$2$ polynomials. One corresponding mapping for a data point $(x, y) \in \mathbb{R}^{2}$ would be $\phi(x, y) = (1, x, y, x^{2}, y^{2}, xy) \in \mathbb{R}^{6}$ which already represents a jump by a factor of 3!

Furthermore, it may not be at all apparent what the type of underlying non-linearity is in any given dataset which then prompts questions like “How does one go about choosing a mapping $\phi$?” and existential dilemmas like “Why must everything be so infernally hard?“.

## The Rally

It is at this point in the story that we must pick ourselves up and return to the drawing board, doubly determined to find something we previously missed. To this endeavour, we rewrite our expression for $\mathbf{w}$ as follows:

\begin{aligned} \mathbf{w} & = \lambda^{-1} \mathbf{X}^{T}\big( \mathbf{y} - \mathbf{X}\mathbf{w}\big) \\ & = \mathbf{X}^{T} \mathbf{\alpha} \end{aligned}

where $\alpha = \lambda^{-1} \big( \mathbf{y} - \mathbf{X}\mathbf{w} \big)$.

Now admittedly, this does not seem to be helpful: after all, we have an expression for $\mathbf{w}$ which in turn depends upon $\mathbf{w}$! What this does show, however, is that $\mathbf{w}$ can be written as a linear combination of our data points (recall that we simply stacked them in rows to form $\mathbf{X}$) with coefficients given by $\mathbf{\alpha}$. So if we could somehow compute $\mathbf{\alpha}$, we would be in a better position. Let us try and do precisely that!

\begin{aligned} \lambda \mathbf{\alpha} & = \big( \mathbf{y} - \mathbf{X}\mathbf{w} \big) \\ & \implies \lambda \mathbf{\alpha} = (\mathbf{y} - \mathbf{X}\mathbf{X}^{T}\mathbf{\alpha}) \\ & \implies \big( \mathbf{X}\mathbf{X}^{T} + \lambda \mathbf{I}_{N}\big) \mathbf{\alpha} = \mathbf{y} \\ & \implies \mathbf{\alpha} = \big( \mathbf{X}\mathbf{X}^{T} + \lambda \mathbf{I}_{N}\big)^{-1} \mathbf{y} \end{aligned}

This suggests that the computational complexity of computing $\mathbf{\alpha}$ is now simply $\mathcal{O}(N^{3})$! In fact, as before, precisely the same argument shows that if we apply a feature map $\phi$ on each row of our data (and denote the resulting matrix by $\phi(\mathbf{X})$), we have $\mathbf{\alpha} = \big( \phi(\mathbf{X})\phi(\mathbf{X})^{T} + \lambda \mathbf{I}_{N}\big)^{-1} \mathbf{y}$ Thus if our feature maps are in fact such that $K > N$, then this would end up being significantly more efficient!

The above reformulation (into what is known as the dual problem) might seem odd at the outset. After all, what sort of convoluted feature maps have dimensionality greater than the number of data points?! Those of you well versed with measuring computational cost may have even noticed that while our focus has been primarily on the computational cost of inverting the resultant $N \times N$ matrix, computing this matrix itself entails evaluating the matrix $\mathbf{\phi(X)} \mathbf{\phi(X)}^{T}$. The computational cost of this actually turns out to be $\mathcal{O}(N^{2}K)$, so if indeed $K >> N$, then, for any reasonably sized dataset, is this really that significant an improvement as we had hoped?

It would seem odd-er still that in a post ostensibly about kernel methods, the word has not been mentioned once since the introduction. Well, your wait has ended!

A function $\kappa$ is called a kernel if there exists a feature mapping $\phi$, such that for all pairs of data points $\mathbf{x}_{n}, \mathbf{x}_{m}$, $\kappa(\mathbf{x}_{n}, \mathbf{x}_{m}) = \langle \phi(\mathbf{x}_{n}), \phi(\mathbf{x}_{m})\rangle$ where $\langle.,.\rangle$ denotes an inner product.

In our reformulation above, notice the structure of the matrix $\phi(\mathbf{X}) \phi(\mathbf{X})^{T}$, often called the Gram Matrix. This is nothing but the matrix of inner products of all pairs $\phi(\mathbf{x}_{n})$ and $\phi(\mathbf{x}_{m})$, i.e., the entry in the second row, third column of the Gram matrix is simply $\langle \phi(\mathbf{x}_{2}), \phi(\mathbf{x}_{3})\rangle$. “Alright, so we can consider each entry of the Gram matrix as the output of a function that we call a kernel - great. Now what?” I hear you ask. Well, what if we could evaluate the kernel without having to explicitly construct the feature vectors $\phi(\mathbf{x}_{n})$ and $\phi(\mathbf{x}_{m})$? Seems strange? Let me elaborate.

Consider the feature map $\phi: \mathbb{R}^{2} \to \mathbb{R}^{3}$ defined as $\phi(a, b) = (a^{2}, b^{2}, \sqrt{2}ab)$

Now, the corresponding kernel is given by

\begin{aligned} \kappa(\mathbf{w}, \mathbf{z}) & = \kappa((w_{1}, w_{2}), (z_{1}, z_{2})) \\ &= \langle \phi(\mathbf{w}), \phi(\mathbf{z}) \rangle\\ & = w_{1}^{2}z_{1}^{2} + w_{2}^{2}z_{2}^{2} + 2 w_{1} w_{2} z_{1} z_{2}\\ & = \big( w_{1}z_{1} + w_{2}z_{2}\big)^{2}\\ & = \big(\langle \mathbf{w}, \mathbf{z} \rangle \big)^{2} \end{aligned}

The above computation shows that we no longer need to even construct the $K$ (in this case = 3) dimensional vector, let alone compute dot products between them! Our computational cost then turns out to be $\mathcal{O}(N^{2}k)$ which is much better! This seemingly magical feat of being able to compute the dot product without ever even constructing the vector is often referred to as the kernel trick.

## The Glory

Our discussion above would be a reasonable place to conclude matters: We have been able to, in a reasonably efficient manner, execute (kernelised) non-linear regression. However, where would we be without the saccharin-sweet Hollywood ending? Kernel methods do not (or do, depending on your taste in movies) disappoint.

In the preceding section, we approached the kernel trick assuming knowledge of a given $\phi$ and highlighted how computational savings can be made via a kernel. We could also invert this process to obtain the following alternative characterisation of kernels: Any (continuous) function $\kappa: X \times X \to \mathbb{R}$ is a kernel (i.e., has a corresponding feature map, $\phi$) if and only if it is positive semi-definite.^2 loosely speaking, this can be interpreted as saying, any function satisfying a certain (fairly generous) condition is a kernel and intrinsically corresponds to a feature map. This further reinforces the idea that we need not explicitly construct the feature map and compute the inner product. In fact, this theorem goes one step further to imply that we do not even need to know the underlying feature map! Simply the fact that one such exists is sufficient for all the above theory to hold!

We have time enough for one final, riding-into-the-sunset moment: Since we no longer need to construct the feature map or perform operations on it, could the feature map send our data points into… an infinite dimensional space? YES! There are a number of kernels which correspond to infinite-dimensional feature maps, which thereby provide the algorithm with immense modelling power! One popular such kernel, the Gaussian kernel, defined as $\kappa(\mathbf{x}, \mathbf{y}) = e^{-\gamma \| \mathbf{x} - \mathbf{y}\|^{2}}$ is immensely powerful and cannot be represented by a finite dimensional feature map! And so it is, that with this simple kernel trick, we can now effectively carry out non-linear regression with a highly convoluted non-linearity without ever having to depart from the realm of linear regression! Neat, right? It is precisely this kernel trick that enabled Support Vector Machines (SVMs) to achieve stellar results in myriad fields, ranging from image classification to natural language processing.

## The Spin-off

There is one thread in our story that remains unresolved: the question of how to go about choosing a suitable non-linearity given a problem setting. The notion of a kernel abstracts the idea of a feature map away from us to some extent, but the theory suggests that there does remain a feature map underlying it all.

One approach is to choose a class of feature maps and then, in some sense, average results over all the feature maps in that class. These approaches elegantly extend the above ideas and give rise to what is known as Gaussian Process Regression. And with that tantalising glimpse into what lies ahead, we bid adieu and hope to see you again for our future posts!

## The Credits

There are innumerable excellent resources that delve into the minutiae of kernel methods. However, my personal favourite, and one that a lot of this post is based on, is Kernel Methods for Pattern Analysis by John Shawe Taylor. The entire book is very well structured and easy to read, but in particular chapters 1 and 2 motivate these approaches seamlessly and are more rigorous with some of the technical nuances that we have ommited for the sake of brevity.

1. Observe here that we have discarded the bias/intercept term. This is simply because we can augment the vector as $\tilde{\mathbf{x}}_{n} = (\mathbf{x_{n}}, 1) \in \mathbb{R}^{k+1}$ and recover the bias term using our above formulation.

2. A function $\kappa: \mathbf{X} \times \mathbf{X} \to \mathbb{R}$ is called positive semi definite if, for any finite set of points $\mathbf{x}_{1}, ..., \mathbf{x}_{n}$, the matrix $K$ whose $(i, j)$th entry is given by $\kappa(\mathbf{x}_{i}, \mathbf{x}_{j})$, is positive semi-definite.