2012
Krizhevsky's CNN crushed ImageNet by 10%. Deeper than LeNet with ReLU, dropout, and GPU training. Proved deep learning works at scale.
227×227 → Conv(96) → Pool → Conv(256) → Pool → Conv(384) → Conv(384) → Conv(256) → FC → 1000 classes
2014
Randomly "kill" neurons during training. Forces the network to not rely on any single neuron — like training an ensemble of sub-networks.
Key regularization technique first used in
AlexNet; prevents overfitting in all deep networks from
ResNet to
Transformer.
During training: hᵢ = hᵢ × Bernoulli(p) — each neuron dropped with probability (1-p)
2013
Mikolov's word embeddings — learn vector representations where semantic relationships become arithmetic: King − Man + Woman ≈ Queen.
Simplifies
NNLM into efficient embedding training; its dense vectors become the input layer for
ELMo,
BERT, and all modern NLP.
king − man + woman ≈ queen — semantic arithmetic in vector space!
2013
VAE (Variational Autoencoder) Paper
Kingma's generative model — encode data into a smooth latent distribution, sample, and decode. The latent space is continuous and interpolatable.
Encode: x → (μ, σ²) → sample z ~ N(μ,σ²) → Decode: z → x̂ — smooth generative latent space
2014
GAN (Generative Adversarial Network) Paper
Goodfellow's brilliant idea — two networks competing: Generator creates fakes, Discriminator judges real vs fake. They push each other to improve.
min_G max_D [E log D(x) + E log(1-D(G(z)))] — Generator vs Discriminator game
2014
Seq2Seq + Attention Paper
Bahdanau's attention mechanism — let the decoder LOOK BACK at relevant input parts at each step. Revolutionary for translation.
Extends
LSTM encoder-decoder with attention; this attention idea is generalized into the pure-attention
Transformer.
attention(Q,K) = softmax(Q·Kᵀ) · V — focus on relevant input words for each output word
2015
ResNet (Residual Network) Paper
He's skip connections — learn the residual F(x) = H(x) − x. Output = F(x) + x. Lets gradients flow through shortcuts, enabling 152+ layers.
Solves the depth problem of
AlexNet with skip connections; its residual design is adopted by
Transformer and
ViT.
output = F(x) + x — skip connection lets gradient flow through identity shortcut
2015
Batch Normalization Paper
Ioffe & Szegedy — normalize each layer's inputs to zero mean and unit variance. Speeds up training and allows higher learning rates.
Stabilizes deep network training for
ResNet and beyond; adapted as LayerNorm in
Transformer and all modern architectures.
x̂ = (x − μ_batch) / √(σ²_batch + ε) · γ + β — normalize, then scale and shift (learnable)