Section VI · 2012–2015

The Deep Learning Explosion

AlexNet triggers the revolution. GANs, attention, ResNets, and Word2Vec reshape AI forever.

2012

Krizhevsky's CNN crushed ImageNet by 10%. Deeper than LeNet with ReLU, dropout, and GPU training. Proved deep learning works at scale.

Scales CNN/LeNet with Dropout and GPU power; its ImageNet victory ignites the deep learning era leading to ResNet and ViT.

227×227 → Conv(96) → Pool → Conv(256) → Pool → Conv(384) → Conv(384) → Conv(256) → FC → 1000 classes

2014

Randomly "kill" neurons during training. Forces the network to not rely on any single neuron — like training an ensemble of sub-networks.

Key regularization technique first used in AlexNet; prevents overfitting in all deep networks from ResNet to Transformer.

During training: hᵢ = hᵢ × Bernoulli(p) — each neuron dropped with probability (1-p)

2013

Mikolov's word embeddings — learn vector representations where semantic relationships become arithmetic: King − Man + Woman ≈ Queen.

Simplifies NNLM into efficient embedding training; its dense vectors become the input layer for ELMo, BERT, and all modern NLP.

king − man + woman ≈ queen — semantic arithmetic in vector space!

2013

Kingma's generative model — encode data into a smooth latent distribution, sample, and decode. The latent space is continuous and interpolatable.

Adds probabilistic sampling to Sparse Autoencoder using GMM+EM's variational ideas; its latent space concept flows into Diffusion Models.

Encode: x → (μ, σ²) → sample z ~ N(μ,σ²) → Decode: z → x̂ — smooth generative latent space

2014

Goodfellow's brilliant idea — two networks competing: Generator creates fakes, Discriminator judges real vs fake. They push each other to improve.

A new generative paradigm rivaling VAE; leads to StyleGAN for photorealistic faces, later surpassed by Diffusion Models.

min_G max_D [E log D(x) + E log(1-D(G(z)))] — Generator vs Discriminator game

2014

Bahdanau's attention mechanism — let the decoder LOOK BACK at relevant input parts at each step. Revolutionary for translation.

Extends LSTM encoder-decoder with attention; this attention idea is generalized into the pure-attention Transformer.

attention(Q,K) = softmax(Q·Kᵀ) · V — focus on relevant input words for each output word

2015

He's skip connections — learn the residual F(x) = H(x) − x. Output = F(x) + x. Lets gradients flow through shortcuts, enabling 152+ layers.

Solves the depth problem of AlexNet with skip connections; its residual design is adopted by Transformer and ViT.

output = F(x) + x — skip connection lets gradient flow through identity shortcut

2015

Ioffe & Szegedy — normalize each layer's inputs to zero mean and unit variance. Speeds up training and allows higher learning rates.

Stabilizes deep network training for ResNet and beyond; adapted as LayerNorm in Transformer and all modern architectures.

x̂ = (x − μ_batch) / √(σ²_batch + ε) · γ + β — normalize, then scale and shift (learnable)