This is a live blog of Lecture 6 (part 2 of 3) of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents for this series is here.
Yesterday, we described the model for generalization. Today, we’ll look at three paths to prove machine learning algorithms generalize. Let’s first remind ourselves of the setup.
We assume that data points are samples independently from the same probability distribution. We’d like to estimate the probability that an algorithm trained on n such samples makes an error when tested on sample n+1. We call the error measured on the n samples the empirical error. We call the error on the unseen sample n+1 the population error. Can we prove the empirical error and population error are close?
Path 1 - Uniform Convergence
The law of large numbers says that sample averages are good approximations of expected values. These can be made quantitative. For example, Hoeffding’s Inequality says that the probability that a random sample of 1s and 0s has a mean more than t away from the distribution’s expected value is less than
Let’s apply this to generalization. Last time, we said that the empirical risk of a prediction function, the average error on the training set, is a sample average. It is approximating the expected error on the unseen data.
Let’s say we have a set of K candidate prediction functions. The probability that any one of these K functions has an estimated sample error that is more than t away from its population error is at most
This formula is a consequence of the union bound in probability: The probability event A happening or event B happening is at most the probability of event A happening plus the probability of event B happening. In this case, the bad events we are bounding are “drawing an unlucky sample so that the empirical error on one function differs significantly from the population error.” Since there are K functions, we just multiply Hoeffding’s bound by K.
Now, let’s stare at that probability expression. The probability decreases exponentially with the number of samples. But it increases linearly with the number of functions. This imbalance works in favor of throwing spaghetti at the wall. If we have a million examples, a small data set by today’s standards, how many functions can we evaluate and be sure that our error estimates are precise to 1%? The answer is over 1080, more than the number of atoms in the universe.
This even suggests an algorithm we can run! We can try every one of our functions and take the one that has smallest error on the training set. This is more or less what we do when we minimize some loss function.
So what’s wrong with this? The problem is that you have to pick these functions in a data-independent matter. You have to write down all the possible functions in advance before you ever see any data. But this makes life hard even for simple models. Think about the number of linear classifiers you could try on some data. The number of 100-dimensional linear classifiers with weights equal to 0 or 1 is 2100. This already exceeds the 1080 I mentioned to impress you a paragraph ago. This union bound suffers from a curse of dimensionality, and, as I’ll describe tomorrow, the predictions of uniform convergence almost never match what we see in practice.
Path 2 - Sequential Prediction
We discussed how the Perceptron and stochastic gradient search for linear classifiers satisfied sequential prediction bounds. We algorithmically generated a sequence of prediction functions. ft was built using the first t data points. The sequential bound computed the average error on the next data point and showed it converged to the sample error of the best possible decision rule
where BT is a sequence that goes to zero as T grows.
If you can prove one of these bounds, you immediately get a bound on generalization. The sequential prediction bound holds for any data sequence. In particular, it must hold when each data point is sampled independently from the same distribution. In this case, the expected loss of ft on the next data point is exactly equal to the population error accrued had we built a prediction function with t samples. The inequality says the expected error on t samples converges to the error best rule you could have come up with on the population.
The analysis I describe here is called online-to-batch conversion. Any guarantee about an online algorithm becomes a generalization bound. The downside of this approach is that it requires you to prove a sequential prediction bound, and even for linear models, that can be a pain.
Path 3 - Leave-One-Out Approximations.
Suppose we have any algorithm that takes a data set and gives us a prediction function. If the data is i.i.d., then the population error equals the expected leave-one-out error. That is, the population error is the chance you make a mistake on data point n+1 after training on n data points. For i.i.d. data, the population error is also the chance you make a mistake on data point 3 after fitting a prediction function with all of the first n+1 samples but data point 3. The same is true for all of the other points. I.i.d. means the order doesn’t matter. The population error is equal to the expected average empirical error on each point when it is left out. This identity is the idea behind why we do cross-validation.
But you can use the leave-one-out error to generate generalization bounds too. Roughly speaking, if you can show that the model you get after training with all of your data but one point is pretty close to the model you’d get training with all of your data, then a leave-one-out argument will tell you that you generalize. Algorithms that are robust to dropping or resampling a data point are called stable. In Chapter 2 of PPA, we present a very elegant leave-one-out argument by Vapnik and Chervnonenkis that proves the Perceptron generalizes. And in Chapter 6, we expand this idea to more general learning algorithms.
I find stability arguments hardest to argue against. Our machine learning methods should be robust to throwing away one data point! Of course, whether you can quantify how robust they are and use this to guide practice is a different story. Tomorrow, I’ll wrap up this series with commentary about how even the sum of the three paths does not suffice to tell the whole story of why machine learning works.