1998
CNN / LeNet Paper
LeCun's convolutional neural network for handwritten digit recognition. Convolution filters slide over the image to extract features, then pooling shrinks them.
Input → [Conv → ReLU → Pool] × N → Flatten → Dense → Output class
1997
LSTM (Long Short-Term Memory) Paper
Hochreiter & Schmidhuber's solution to vanishing gradients. Three gates (forget, input, output) control what to remember, add, and output from the cell state.
Solves
RNN's vanishing gradient problem with gated memory; enables
Seq2Seq translation and
ELMo embeddings.
fₜ = σ(forget) iₜ = σ(input) oₜ = σ(output) cₜ = fₜ⊙cₜ₋₁ + iₜ⊙tanh(…)
1995
SVM (Support Vector Machine) Paper
Vapnik's maximum-margin classifier — find the hyperplane that separates classes with the widest possible margin. Support vectors define the boundary.
Extends
k-NN's distance-based idea with kernel tricks for non-linear boundaries; dominated ML before
AlexNet proved deep learning superior.
maximize margin = 2/||w|| subject to yᵢ(w·xᵢ + b) ≥ 1
1977
GMM + EM Algorithm Paper
Fit a mixture of Gaussians to data using Expectation-Maximization. E-step: soft cluster assignment. M-step: update parameters. Iterate until convergence.
Applies
Bayes' Theorem to unsupervised clustering with latent variables; EM's iterative approach later inspires
VAE's variational inference.
E: P(k|xᵢ) = πₖN(xᵢ|μₖ,σₖ) / Σⱼ πⱼN(xᵢ|μⱼ,σⱼ) M: update μ,σ,π
2001
Random Forest Paper
Breiman's ensemble of decision trees — each tree trained on a random subset of data and features. Final prediction by majority vote.
Ensembles many
Decision Trees via bagging to reduce overfitting; the ensemble idea is refined by
GBDT and
XGBoost using boosting instead.
prediction = mode(tree₁(x), tree₂(x), ..., treeₙ(x)) — majority vote of random trees
1997
Freund & Schapire's adaptive boosting — train weak classifiers sequentially, each focusing on mistakes of previous ones by upweighting misclassified samples.
H(x) = sign(Σ αₜhₜ(x)) where αₜ = ½ ln((1-εₜ)/εₜ) — weight by accuracy