Section VI · 2012–2015

The Deep Learning Explosion

AlexNet triggers the revolution. GANs, attention, ResNets, and Word2Vec reshape AI forever.

2012

AlexNet Paper

Krizhevsky's CNN crushed ImageNet by 10%. Deeper than LeNet with ReLU, dropout, and GPU training. Proved deep learning works at scale.

Scales CNN/LeNet with Dropout and GPU power; its ImageNet victory ignites the deep learning era leading to ResNet and ViT.
227×227 → Conv(96) → Pool → Conv(256) → Pool → Conv(384) → Conv(384) → Conv(256) → FC → 1000 classes
2014

Dropout Paper

Randomly "kill" neurons during training. Forces the network to not rely on any single neuron — like training an ensemble of sub-networks.

Key regularization technique first used in AlexNet; prevents overfitting in all deep networks from ResNet to Transformer.
During training: hᵢ = hᵢ × Bernoulli(p) — each neuron dropped with probability (1-p)
2013

Word2Vec Paper

Mikolov's word embeddings — learn vector representations where semantic relationships become arithmetic: King − Man + Woman ≈ Queen.

Simplifies NNLM into efficient embedding training; its dense vectors become the input layer for ELMo, BERT, and all modern NLP.
king − man + woman ≈ queen — semantic arithmetic in vector space!
2013

VAE (Variational Autoencoder) Paper

Kingma's generative model — encode data into a smooth latent distribution, sample, and decode. The latent space is continuous and interpolatable.

Adds probabilistic sampling to Sparse Autoencoder using GMM+EM's variational ideas; its latent space concept flows into Diffusion Models.
Encode: x → (μ, σ²) → sample z ~ N(μ,σ²) → Decode: z → x̂ — smooth generative latent space
2014

GAN (Generative Adversarial Network) Paper

Goodfellow's brilliant idea — two networks competing: Generator creates fakes, Discriminator judges real vs fake. They push each other to improve.

A new generative paradigm rivaling VAE; leads to StyleGAN for photorealistic faces, later surpassed by Diffusion Models.
min_G max_D [E log D(x) + E log(1-D(G(z)))] — Generator vs Discriminator game
2014

Seq2Seq + Attention Paper

Bahdanau's attention mechanism — let the decoder LOOK BACK at relevant input parts at each step. Revolutionary for translation.

Extends LSTM encoder-decoder with attention; this attention idea is generalized into the pure-attention Transformer.
attention(Q,K) = softmax(Q·Kᵀ) · V — focus on relevant input words for each output word
2015

ResNet (Residual Network) Paper

He's skip connections — learn the residual F(x) = H(x) − x. Output = F(x) + x. Lets gradients flow through shortcuts, enabling 152+ layers.

Solves the depth problem of AlexNet with skip connections; its residual design is adopted by Transformer and ViT.
output = F(x) + x — skip connection lets gradient flow through identity shortcut
2015

Batch Normalization Paper

Ioffe & Szegedy — normalize each layer's inputs to zero mean and unit variance. Speeds up training and allows higher learning rates.

Stabilizes deep network training for ResNet and beyond; adapted as LayerNorm in Transformer and all modern architectures.
x̂ = (x − μ_batch) / √(σ²_batch + ε) · γ + β — normalize, then scale and shift (learnable)