Section V · 2000s

The Eve of Deep Learning

Deep belief nets, autoencoders, gradient boosting, and neural language models set the stage for the deep learning revolution.

2006

Deep Belief Network (DBN) Paper

Hinton's breakthrough — train deep networks by stacking Restricted Boltzmann Machines one layer at a time. Each layer learns increasingly abstract features.

Stacks Boltzmann Machine layers with greedy pretraining; this first successful deep network paves the way for AlexNet and all modern deep learning.
Layer 1: learn edges → Layer 2: learn shapes → Layer 3: learn objects (greedy pretraining)
2006

Sparse Autoencoder Paper

Compress data through a bottleneck, then reconstruct it. Sparsity constraint ensures only a few neurons activate — forcing efficient, meaningful features.

Learns compressed representations like DBN but via reconstruction loss; its encode-decode structure directly leads to VAE (adds probabilistic sampling).
Input → Encoder (compress) → Bottleneck (sparse code) → Decoder (reconstruct) → Output ≈ Input
2008

Denoising Autoencoder Paper

Corrupt the input with noise, then train the network to reconstruct the CLEAN original. Forces robust features that capture true data structure.

Extends Sparse Autoencoder with noise-based regularization; the "denoise to learn" principle directly inspires Diffusion Models.
Clean x → Add noise → x̃ → Encoder → Decoder → x̂ ≈ x (not x̃!) — learn to denoise
2001

GBDT (Gradient Boosted Decision Trees) Paper

Friedman's gradient boosting — each new tree fits the RESIDUAL errors of the previous ensemble. Sequentially reduces loss by correcting current mistakes.

Replaces AdaBoost's sample reweighting with gradient-based residual fitting on Decision Trees; optimized into XGBoost.
F_m(x) = F_{m-1}(x) + η · h_m(x) where h_m fits the residuals r = y - F_{m-1}(x)
2003

NNLM (Neural Network Language Model) Paper

Bengio's breakthrough — predict the next word using a neural network over word embeddings. Each word gets a learned vector representation.

First neural approach to language modeling using Backpropagation; its word embeddings lead to Word2Vec and its next-word prediction paradigm leads to GPT.
P(wₜ | wₜ₋₁, wₜ₋₂, ...) = softmax(W · tanh(C · [e(wₜ₋₁); e(wₜ₋₂); ...]))