Muchang Bahng | Duke Math

In any machine learning problem, we are looking for a best function to fit some data. This function may be a regressor, a classifier (supervised learning), a projection operator, a random variable (unsupervised learning), or anything in between. But when we talk about "best," we must define two things. First, what set of functions it is best over (since otherwise we could just choose an abtrarily complex function that will interpolate the data)? Second, what do we even mean by "best"?

Let's deal with the first question. In many theoretical statistics, the three most natural function spaces to work with are Holder spaces, Soblov spaces, and RKHS. Many of these properties will be covered in a functional analysis course, but we go into more detail specifically for learning theory. Another essential tool that we will use constantly is concentration of measure, which will give us theoretical guarantees of how models learn.

The next step is to define what "best" means. This is done through a loss function, and there are generally two schools of thought on how we can construct a loss function. The first is through statistical inference, which attempts to take data and use it to infer what the underlying distribution is. This is usually an extremely difficult problem, and there are again two schools of thought on it. In frequentist inference, rather than trying to describe the entire distribution directly, we try to talk about certain parameters about the distribution (e.g. what is the mean, variance?). In Bayesian inference, we take a prior distribution and update it using Bayes rule to get our posterior. Even when the models become moderately complex, these problems are not analytically solvable, which is why we need numerical optimization or sampling methods.

The second school of thought on constructing loss functions is statistical decision theory. Rather than use inference to estimate our true distribution, we attempt to create a loss function from scratch by quantifying the cost-benefits of each term. This is useful for adjusting canonical standards (e.g. why p=0.05?) that may sometimes be arbitrary in practice.

A very popular interpretation of the loss function is the bias-variance tradeoff. Note that the tradeoff is a property of a loss function, not a model or a function space! Now that we have a loss function, we can define our risk as the expected loss over the true data generating distribution. Unfortunately, we don't know this value since we don't know the true distribution, so we can take the empirical risk and minimize that. This is the core idea of empirical risk minimization in learning theory. Ideally, we would want the true risk and the empirical risk to be pretty close together most of the time, and in statistical learning theory, we use concentration of measure to bound these. Usually, you would see that the bound is decaying with $n$ (which means that we get better approximates with more samples) and increasing with $d$ (curse of dimensionality). The difference between our estimator and the true best function (or the bounds themselves) can often be decomposed into three terms which are analogues of the bias, variance, and noise. This gives a slightly more general interpretation of the tradeoff.

Once the foundations of these two theories are established, the main applications lie in machine learning, which uses computer science to design algorithmic approaches to learning. This field can be seen as an integration of statistics and computer science, since we heavily use the theory of algorithms to optimize objective functions that are determined by statistics (e.g. greedy algorithms to fit decision trees, L1 regularization as a convex approximation of best subset regression). I've thought for a while on how to best organize these notes, since you can divide them according to different properties. I've realized that there's no right answer, but generally when I do research, there is usually a specific model or a set of similar models I have in mind (e.g. trees). Therefore, I like to divide them by the architecture of the model, and keep all the theory, properties, and algorithms in one place for reference.

With the universal approximation theorem, better engineering, and exponentially-increasing computatonal power, deep neural networks have become extremely powerful models for complex and high-dimensional data. They start out with simple multilayer perceptrons but recent research has pushed the architectures to CNNs, RNNs, LSTMs, energy models, encoder-decoders, flow models, attention layers, and most recently diffusion models. While the models are inherent black-box in nature, several heuristics and architectures have been developed to push their applications to the field of computer vision (CV) and natural language processing (NLP). The most noticeable success of these applications come in autonomous driving and large language models (LLMs).

Finally, another subfield of machine learning is called reinforcement learning, which teaches agents to make decisions through simulations involving trial and error. These models are widely used in robotics and simulations, and this field of optimizing rewards and penalties heavily relies on decision theory.

All of my personal notes are free to download, use, and distrbute under the Creative Commons "Attribution- NonCommercial-ShareAlike 4.0 International" license. Please contact me if you find any errors in my notes or have any further questions.