Section IV · 1990s–2001

The Golden Age of Statistical Learning

SVM, LSTM, Random Forest — the classics that powered ML before deep learning took over.

1998

CNN / LeNet Paper

LeCun's convolutional neural network for handwritten digit recognition. Convolution filters slide over the image to extract features, then pooling shrinks them.

Adds Backpropagation training to Neocognitron's hierarchical design; directly leads to AlexNet and ResNet.
Input → [Conv → ReLU → Pool] × N → Flatten → Dense → Output class
1997

LSTM (Long Short-Term Memory) Paper

Hochreiter & Schmidhuber's solution to vanishing gradients. Three gates (forget, input, output) control what to remember, add, and output from the cell state.

Solves RNN's vanishing gradient problem with gated memory; enables Seq2Seq translation and ELMo embeddings.
fₜ = σ(forget)   iₜ = σ(input)   oₜ = σ(output)   cₜ = fₜ⊙cₜ₋₁ + iₜ⊙tanh(…)
1995

SVM (Support Vector Machine) Paper

Vapnik's maximum-margin classifier — find the hyperplane that separates classes with the widest possible margin. Support vectors define the boundary.

Extends k-NN's distance-based idea with kernel tricks for non-linear boundaries; dominated ML before AlexNet proved deep learning superior.
maximize margin = 2/||w||   subject to yᵢ(w·xᵢ + b) ≥ 1
1977

GMM + EM Algorithm Paper

Fit a mixture of Gaussians to data using Expectation-Maximization. E-step: soft cluster assignment. M-step: update parameters. Iterate until convergence.

Applies Bayes' Theorem to unsupervised clustering with latent variables; EM's iterative approach later inspires VAE's variational inference.
E: P(k|xᵢ) = πₖN(xᵢ|μₖ,σₖ) / Σⱼ πⱼN(xᵢ|μⱼ,σⱼ)   M: update μ,σ,π
2001

Random Forest Paper

Breiman's ensemble of decision trees — each tree trained on a random subset of data and features. Final prediction by majority vote.

Ensembles many Decision Trees via bagging to reduce overfitting; the ensemble idea is refined by GBDT and XGBoost using boosting instead.
prediction = mode(tree₁(x), tree₂(x), ..., treeₙ(x)) — majority vote of random trees
1997

AdaBoost Paper

Freund & Schapire's adaptive boosting — train weak classifiers sequentially, each focusing on mistakes of previous ones by upweighting misclassified samples.

Introduces sequential boosting of Decision Tree stumps; directly inspires GBDT (gradient-based boosting) and XGBoost.
H(x) = sign(Σ αₜhₜ(x))   where αₜ = ½ ln((1-εₜ)/εₜ) — weight by accuracy