Muchang Bahng

Statistics

In any machine learning problem, we are looking for a best function to fit some data. This function may be a regressor, a classifier (supervised learning), a projection operator, a random variable (unsupervised learning), or anything in between. But when we talk about "best," we must define two things. First, what set of functions it is best over (since otherwise we could just choose an abtrarily complex function that will interpolate the data)? Second, what do we even mean by "best"?

Let's deal with the first question. In many theoretical statistics, the three most natural function spaces to work with are Holder spaces, Soblov spaces, and RKHS. Many of these properties will be covered in a functional analysis course, but we go into more detail specifically for learning theory. Another essential tool that we will use constantly is concentration of measure, which will give us theoretical guarantees of how models learn.

The next step is to define what "best" means. This is done through a loss function, and there are generally two schools of thought on how we can construct a loss function. The first is through statistical inference, which attempts to take data and use it to infer what the underlying distribution is. This is usually an extremely difficult problem, and there are again two schools of thought on it. In frequentist inference, rather than trying to describe the entire distribution directly, we try to talk about certain parameters about the distribution (e.g. what is the mean, variance?). In Bayesian inference, we take a prior distribution and update it using Bayes rule to get our posterior. Even when the models become moderately complex, these problems are not analytically solvable, which is why we need numerical optimization or sampling methods.

The second school of thought on constructing loss functions is statistical decision theory. Rather than use inference to estimate our true distribution, we attempt to create a loss function from scratch by quantifying the cost-benefits of each term. This is useful for adjusting canonical standards (e.g. why p=0.05?) that may sometimes be arbitrary in practice.

A very popular interpretation of the loss function is the bias-variance tradeoff. Note that the tradeoff is a property of a loss function, not a model or a function space! Now that we have a loss function, we can define our risk as the expected loss over the true data generating distribution. Unfortunately, we don't know this value since we don't know the true distribution, so we can take the empirical risk and minimize that. This is the core idea of empirical risk minimization in learning theory. Ideally, we would want the true risk and the empirical risk to be pretty close together most of the time, and in statistical learning theory, we use concentration of measure to bound these. Usually, you would see that the bound is decaying with $n$ (which means that we get better approximates with more samples) and increasing with $d$ (curse of dimensionality). The difference between our estimator and the true best function (or the bounds themselves) can often be decomposed into three terms which are analogues of the bias, variance, and noise. This gives a slightly more general interpretation of the tradeoff.

Once the foundations of these two theories are established, the main applications lie in machine learning, which uses computer science to design algorithmic approaches to learning. This field can be seen as an integration of statistics and computer science, since we heavily use the theory of algorithms to optimize objective functions that are determined by statistics (e.g. greedy algorithms to fit decision trees, L1 regularization as a convex approximation of best subset regression). I've thought for a while on how to best organize these notes, since you can divide them according to different properties. I've realized that there's no right answer, but generally when I do research, there is usually a specific model or a set of similar models I have in mind (e.g. trees). Therefore, I like to divide them by the architecture of the model, and keep all the theory, properties, and algorithms in one place for reference.

With the universal approximation theorem, better engineering, and exponentially-increasing computatonal power, deep neural networks have become extremely powerful models for complex and high-dimensional data. They start out with simple multilayer perceptrons but recent research has pushed the architectures to CNNs, RNNs, LSTMs, energy models, encoder-decoders, flow models, attention layers, and most recently diffusion models. While the models are inherent black-box in nature, several heuristics and architectures have been developed to push their applications to the field of computer vision (CV) and natural language processing (NLP). The most noticeable success of these applications come in autonomous driving and large language models (LLMs).

Finally, another subfield of machine learning is called reinforcement learning, which teaches agents to make decisions through simulations involving trial and error. These models are widely used in robotics and simulations, and this field of optimizing rewards and penalties heavily relies on decision theory.

All of my personal notes are free to download, use, and distrbute under the Creative Commons "Attribution- NonCommercial-ShareAlike 4.0 International" license. Please contact me if you find any errors in my notes or have any further questions.

Information Theory

  • Entropy: KL-Divergence
  • Compression: Symbol Codes, Huffman

Function Spaces

  • Sobelov Spaces:
  • Holder Spaces:
  • Reproducing Kernel Hilbert Spaces:

Concentration of Measure

  • Hoeffding: Chernoff Bounds, Subgaussian variables
  • Talagrand, Bernstein

Frequentist Inference

  • Goals: Point Estimation, Confidence Sets, Hypothesis Tests
  • Point Estimators: Method of Moments, Maximum Likelihood (MLE)
  • Bias Variance Tradeoff: MSE, MAE, Regularization
  • Asymptotic Analysis

Bayesian Inference

  • Goals: Point Estimation, Credible Intervals, Posterior Inference
  • Notation: Priors, Posteriors, Conjugate Distributions
  • Point Estimators: Posterior Means, Maximum-a-Posteriori (MAP)

Sampling and Integration

  • MCMC: Metropolis-Hastings, Hamiltonian Monte Carlo, Gibbs
  • Langevin: SGLD, MALA
  • Practical: Adaptive Methods, Preconditioning

Learning Theory

  • Empirical Risk Minimization:

Linear Regression

  • Ordinary Least Squares: MSE Loss, Bias-Variance Decomposition, Least-Squares Solution, Gauss-Markov Theorem, Likelihood Estimation with Gaussian residuals
  • Significance Tests, Confidence Sets
  • Regularization: Ridge, Best Subset, Greedy Stepwise, Lasso
  • Robust: Mean Sb
  • Bayesian: Regularization with Priors
  • GLMs: Link Functions, Canonical Link Functions (To be moved)

Linear Classification

  • Perceptron:
  • Logistic/Softmax:
  • Support Vector Machines:
  • LDA:

Time Series

  • Stationarity:
  • Autoregressive:

Trees

  • Decision Trees: Classification/Regression Trees, Model Space
  • Greedy Optimization: ID3 with Information Gain, CART with Gini Reduction, c4.5
  • Improved Optimization: GOSDT
  • Soft Decision Trees: Soft Splitting, Neural Trees

Ensembles

  • Resampling: Jackknife, Bootstrap
  • Boosting: Adaboost, Gradient Boosting
  • Bagging: Random Forests

Factors and Components

  • Factor Analysis:
  • Principal Componenet Analysis: Static, Dynamic, Functional, Robust, Sparse
  • Independent Component Analysis:

Clustering

  • K Means:
  • Isomap:
  • Multidimensional Scaling (MDS)
  • tSNE:
  • Uniform Manifold Approximation and Projection (UMAP):

Latent Variable Models

  • Gaussian Mixture Models
  • Variational Bayes, EM Algorithm

Graphical Models

  • Graphical Distributions: Bayesian Networks (Directed), Markov Random Fields (Undirected)

Deep Learning

  • Multilayer Perceptron: Activation Functions, Preprocessing, Weight Initialization, Weight Space Symmetries
  • Network Training: Automatic Differentiation, Forward/Back Propagation, Numpy Implementation, PyTorch,
  • Regularization and Stability: Early Stopping, Dropout, L1/L2 Penalty Terms, Max Norm Regularization, Normalization Layers, Data Augmentation, Sharpness Awareness Maximization, Network Pruning, Guidelines
  • Convolutional Neural Nets: Kernels, Convolutional Layers, Pooling Layers, Architectures
  • Recurrent Neural Nets: Uni/Bi-directional RNNs, Stacked RNNs, Loss Functions, LSTMs, GRUs
  • Autoencoders:
  • Transformers:

Computer Vision

  • Image Processing: OpenCV Functionality, Transforming, Drawing, Masking, Kernels, Color Channels
  • Convolutional Neural Nets: Convolution Layers, Pooling Layers, Architectures
  • Network Training: Backpropagation, Implementation from Scratch

Natural Language Processing

  • Basics: Regular Expressions, Tokenization, Lemmization, Stemming, NLTK
  • Classical Learning: N-Gram Model, Naive Bayes, Logistic Regression, Sentiment Analysis
  • Embeddings: Frequency Semantics, Word2Vec, Doc2Vec, gensim,
  • Recurrent Neural Nets:

Others

  • Topological Data Analysis
  • Geodesic Regression:
  • Frechet Regression: