Statistics
If you were to invent statistics from scratch, how would I do it? Statistics can be seen as a ``converse'' of probability, and it is essentially a field that branches out from math. In probability, one takes a distribution and attempts to describe what the samples look like. In statistics, we are given the samples first and then try to infer what the distribution is. This is usually an extremly difficult problem, and so rather than trying to describe the entire distribution, we try to talk about certain parameters about the distribution (e.g. what is the mean, variance?).
At the heart of statistics is the seemingly unrelated information theory, which talks about how much information (e.g. bits) can be transmitted through a noisy channel. The concepts of entropy and KL-divergence provide a good transition between probability and statistics. At this point, we can branch off into two paradigms of inference. First is the frequentist approach, which attempts to model the parameter as a random variable that realizes through a sampling distribution. The second is the Bayesian approach, which assumes a prior distribution on the data, which with the data and Bayes rule gets updated to our posterior.
Once the foundations of these two theories are established, the main applications lie in machine learning, which uses computer science to design algorithmic approaches to statistical inference. This field can be seen as an integration of statistics and computer science, since we heavily use the theory of algorithms to optimize objective functions that are determined by statistics (e.g. greedy algorithms to fit decision trees, L1 regularization as a convex approximation of best subset regression). In fact, optimization is such an important subset of machine learning that it deserves its own set of notes. In here, I go through the main concepts in convex optimization, along with an index of other non-convex methods. It turns out that the fields of optimization and sampling (which is also used for numerical integration) are heavily related, as slight modifications of optimizers lead to samplers (e.g. SGD vs SGLD).
With the universal approximation theorem, better engineering, and exponentially-increasing computatonal power, deep neural networks have become extremely powerful models for complex and high-dimensional data. They start out with simple multilayer perceptrons but recent research has pushed the architectures to CNNs, RNNs, LSTMs, energy models, encoder-decoders, flow models, attention layers, and most recently diffusion models. While the models are inherent black-box in nature, several heuristics and architectures have been developed to push their applications to the field of computer vision (CV) and natural language processing (NLP). The most noticeable success of these applications come in autonomous driving and large language models (LLMs).
Finally, another subfield of machine learning is called reinforcement learning, which teaches agents to make decisions through simulations involving trial and error. These models are widely used in robotics and simulations, and this field of optimizing rewards and penalties heavily relies on game theory.
All of my personal notes are free to download, use, and distrbute under the Creative Commons "Attribution- NonCommercial-ShareAlike 4.0 International" license. Please contact me if you find any errors in my notes or have any further questions.
Sampling, Optimization, and Integration
- This is at the heart of all things statistics, so it's worth making a set of notes that outlines the methods of all the sampling and optimization algorithms and the theories behind them. I focus on convex optimization.
- Random Walk Metropolis w/ Preconditioning & Adaptation, Automatic Differentiation, Gradient Descent, SGLD, MALA
- Phase Flows, Hamiltonian Integration, Langevin Integration, Leapfrog Integrator, Splitting Methods
- Hamiltonian Monte Carlo, NUTS
- Netwon's Optimization Method, BFGS, Simulated Annealing, Adam
Frequentist Statistics
- Sampling Distributions: Confidence Intervals, Hypothesis Testing, Central Limit Theorem
Bayesian Statistics
- Bayes Rule: Prior & Posterior Distributions, Likelihood, Marginalization, Bayes Box, Common Distributions, Beta Family, Multivariate Gaussians
- Bayesian Inference: Parameter Estimation, Beta-Binomial Distribution, Conjugate Distributions & Priors, Credible Intervals, Point Estimates, Exponential Family
- Linear Regression: Bayesian & Frequentist Regression, Basis Functions, Hyperparameters & Hierarchical Priors, Parameter Distributions & Predictive Functions, Bayesian Model Selection & Averaging, Gaussian Error/OLS & Laplace Error/LAV, L1 & L2 Regularization w/ Laplace & Gaussian Priors, Sparse Models, Equivalent Kernel
- Markov Chain Monte Carlo: Metropolis-Hastings, Detailed Balance, Monte Carlo Integration, Gibbs Sampling
Classical Machine Learning
- Statistical Learning Theory
- Low and High Dimensional Linear Regression and Classification
- Low and High Dimensional Nonparametric Regression and Classification
- Cross Validation
- Decision Theory
- Generalized Linear Models
- Boosting and Bagging
- Density Estimation and Clustering
- Graphical Models
- Factor Analysis
- Dimensionality Reduction
Deep Learning
- Multilayer Perceptron: Activation Functions, Preprocessing, Weight Initialization, Weight Space Symmetries
- Network Training: Automatic Differentiation, Forward/Back Propagation, Numpy Implementation, PyTorch,
- Regularization and Stability: Early Stopping, Dropout, L1/L2 Penalty Terms, Max Norm Regularization, Normalization Layers, Data Augmentation, Sharpness Awareness Maximization, Network Pruning, Guidelines
- Convolutional Neural Nets: Kernels, Convolutional Layers, Pooling Layers, Architectures
- Recurrent Neural Nets: Uni/Bi-directional RNNs, Stacked RNNs, Loss Functions, LSTMs, GRUs
- Autoencoders:
- Transformers:
Computer Vision
- Image Processing: OpenCV Functionality, Transforming, Drawing, Masking, Kernels, Color Channels
- Convolutional Neural Nets: Convolution Layers, Pooling Layers, Architectures
- Network Training: Backpropagation, Implementation from Scratch
Natural Language Processing
- Basics: Regular Expressions, Tokenization, Lemmization, Stemming, NLTK
- Classical Learning: N-Gram Model, Naive Bayes, Logistic Regression, Sentiment Analysis
- Embeddings: Frequency Semantics, Word2Vec, Doc2Vec, gensim,
- Recurrent Neural Nets:
Introduction to Machine Learning
- Regression: Least-Squares, Normal Equations, Batch/Stochastic Gradient Descent, Polynomial Regression
- Classification: K-Nearest Neighbors, Perceptron, Logistic Regression
- GLMs: Exponential Family, Link Functions, GLM Construction, Softmax Regression, Poisson Regression
- Generative Learning Algorithms: Gaussian Discriminant Analysis, Naive Bayes, Laplace Smoothing
- Kernel Methods: Feature Maps, Kernel Trick
- SVM: Functional, Function & Geometric Margins, Optimal Margin Classifiers, Lagrange Duality, Primal vs Dual Optimization
- Deep Learning: Nonlinear Regression, Mini-batch SGD, Activation Functions (ReLU), 2-Layer & Multilayered Neural Networks, Vectorization, Backpropagation, Convolutional Neural Networks, Graph Neural Networks
- Decision Trees: Recursive Binary Splitting (Greedy Algorithms), Classification Error, Discrete/Continuous Features, Overfitting, Pruning Trees, Random Forest
- Unsupervised Learning: K-Means, Mixture of Gaussians, EM-Algorithm, Convexity, Evidence Lower Bound
- PCA: Factor Analysis, EM Algorithm, Component Eignvectors, SVD, Eigenfaces