Muchang Bahng

Personal Notes: Data Science

The following are the topics I have studied (and will be planning to study) during my time in the Korean military until November 2022. As these notes are primarily for my personal use, I did not spend as much time writing them in a manner that is clear for all readers. But since it would be a waste not to share them, I uploaded them on this website. All of my personal notes are free to download, use, and distrbute under the Creative Commons "Attribution- NonCommercial-ShareAlike 4.0 International" license. Please contact me if you find any errors in my notes or have any further questions. I have used the LaTeX editing program Overleaf to create my notes; diagrams are often drawn using the tikz package or iPad Notes.

Frequentist Statistics

  • Sampling Distributions: Confidence Intervals, Hypothesis Testing, Central Limit Theorem

Bayesian Statistics

  • Bayes Rule: Prior & Posterior Distributions, Likelihood, Marginalization, Bayes Box, Common Distributions, Beta Family, Multivariate Gaussians
  • Bayesian Inference: Parameter Estimation, Beta-Binomial Distribution, Conjugate Distributions & Priors, Credible Intervals, Point Estimates, Exponential Family
  • Linear Regression: Bayesian & Frequentist Regression, Basis Functions, Hyperparameters & Hierarchical Priors, Parameter Distributions & Predictive Functions, Bayesian Model Selection & Averaging, Gaussian Error/OLS & Laplace Error/LAV, L1 & L2 Regularization w/ Laplace & Gaussian Priors, Sparse Models, Equivalent Kernel
  • Markov Chain Monte Carlo: Metropolis-Hastings, Detailed Balance, Monte Carlo Integration, Gibbs Sampling

Introduction to Machine Learning

  • Regression: Least-Squares, Normal Equations, Batch/Stochastic Gradient Descent, Polynomial Regression
  • Classification: K-Nearest Neighbors, Perceptron, Logistic Regression
  • GLMs: Exponential Family, Link Functions, GLM Construction, Softmax Regression, Poisson Regression
  • Generative Learning Algorithms: Gaussian Discriminant Analysis, Naive Bayes, Laplace Smoothing
  • Kernel Methods: Feature Maps, Kernel Trick
  • SVM: Functional, Function & Geometric Margins, Optimal Margin Classifiers, Lagrange Duality, Primal vs Dual Optimization
  • Deep Learning: Nonlinear Regression, Mini-batch SGD, Activation Functions (ReLU), 2-Layer & Multilayered Neural Networks, Vectorization, Backpropagation, Convolutional Neural Networks, Graph Neural Networks
  • Decision Trees: Recursive Binary Splitting (Greedy Algorithms), Classification Error, Discrete/Continuous Features, Overfitting, Pruning Trees, Random Forest
  • Unsupervised Learning: K-Means, Mixture of Gaussians, EM-Algorithm, Convexity, Evidence Lower Bound
  • PCA: Factor Analysis, EM Algorithm, Component Eignvectors, SVD, Eigenfaces

Classical Machine Learning

  • Statistical Learning Theory
  • Low and High Dimensional Linear Regression and Classification
  • Low and High Dimensional Nonparametric Regression and Classification
  • Cross Validation
  • Decision Theory
  • Generalized Linear Models
  • Boosting and Bagging
  • Density Estimation and Clustering
  • Graphical Models
  • Factor Analysis
  • Dimensionality Reduction

Sampling and Optimization

  • This is at the heart of all things statistics, so it's worth making a set of notes that outlines the methods of all the sampling and optimization algorithms and the theories behind them. I focus on convex optimization.
  • Random Walk Metropolis w/ Preconditioning & Adaptation, Automatic Differentiation, Gradient Descent, SGLD, MALA
  • Phase Flows, Hamiltonian Integration, Langevin Integration, Leapfrog Integrator, Splitting Methods
  • Hamiltonian Monte Carlo, NUTS
  • Netwon's Optimization Method, BFGS, Simulated Annealing, Adam

Deep Learning

  • Multilayer Perceptron: Activation Functions, Preprocessing, Weight Initialization, Weight Space Symmetries
  • Network Training: Automatic Differentiation, Forward/Back Propagation, Numpy Implementation, PyTorch,
  • Regularization and Stability: Early Stopping, Dropout, L1/L2 Penalty Terms, Max Norm Regularization, Normalization Layers, Data Augmentation, Sharpness Awareness Maximization, Network Pruning, Guidelines
  • Convolutional Neural Nets: Kernels, Convolutional Layers, Pooling Layers, Architectures
  • Recurrent Neural Nets: Uni/Bi-directional RNNs, Stacked RNNs, Loss Functions, LSTMs, GRUs
  • Autoencoders:
  • Transformers:

Computer Vision

  • Image Processing: OpenCV Functionality, Transforming, Drawing, Masking, Kernels, Color Channels
  • Convolutional Neural Nets: Convolution Layers, Pooling Layers, Architectures
  • Network Training: Backpropagation, Implementation from Scratch

Natural Language Processing

  • Basics: Regular Expressions, Tokenization, Lemmization, Stemming, NLTK
  • Classical Learning: N-Gram Model, Naive Bayes, Logistic Regression, Sentiment Analysis
  • Embeddings: Frequency Semantics, Word2Vec, Doc2Vec, gensim,
  • Recurrent Neural Nets: