Section IV · 1990s–2001

The Golden Age of Statistical Learning

SVM, LSTM, Random Forest — the classics that powered ML before deep learning took over.

1998

CNN / LeNet Paper

LeCun's convolutional neural network for handwritten digit recognition. Convolution filters slide over the image to extract features, then pooling shrinks them.

Adds Backpropagation training to Neocognitron's hierarchical design; directly leads to AlexNet and ResNet.

Input → [Conv → ReLU → Pool] × N → Flatten → Dense → Output class

1997

LSTM (Long Short-Term Memory) Paper

Hochreiter & Schmidhuber's solution to vanishing gradients. Three gates (forget, input, output) control what to remember, add, and output from the cell state.

Solves RNN's vanishing gradient problem with gated memory; enables Seq2Seq translation and ELMo embeddings.

fₜ = σ(forget) iₜ = σ(input) oₜ = σ(output) cₜ = fₜ⊙cₜ₋₁ + iₜ⊙tanh(…)

1995

SVM (Support Vector Machine) Paper

Vapnik's maximum-margin classifier — find the hyperplane that separates classes with the widest possible margin. Support vectors define the boundary.

Extends k-NN's distance-based idea with kernel tricks for non-linear boundaries; dominated ML before AlexNet proved deep learning superior.

maximize margin = 2/||w|| subject to yᵢ(w·xᵢ + b) ≥ 1

1977

GMM + EM Algorithm Paper

Fit a mixture of Gaussians to data using Expectation-Maximization. E-step: soft cluster assignment. M-step: update parameters. Iterate until convergence.

Applies Bayes' Theorem to unsupervised clustering with latent variables; EM's iterative approach later inspires VAE's variational inference.

E: P(k|xᵢ) = πₖN(xᵢ|μₖ,σₖ) / Σⱼ πⱼN(xᵢ|μⱼ,σⱼ) M: update μ,σ,π

2001

Random Forest Paper

Breiman's ensemble of decision trees — each tree trained on a random subset of data and features. Final prediction by majority vote.

Ensembles many Decision Trees via bagging to reduce overfitting; the ensemble idea is refined by GBDT and XGBoost using boosting instead.

prediction = mode(tree₁(x), tree₂(x), ..., treeₙ(x)) — majority vote of random trees

1997

AdaBoost Paper

Freund & Schapire's adaptive boosting — train weak classifiers sequentially, each focusing on mistakes of previous ones by upweighting misclassified samples.

Introduces sequential boosting of Decision Tree stumps; directly inspires GBDT (gradient-based boosting) and XGBoost.

H(x) = sign(Σ αₜhₜ(x)) where αₜ = ½ ln((1-εₜ)/εₜ) — weight by accuracy