Section VII · 2016–2019

The Transformer Revolution

Self-attention, BERT, GPT — the architecture that changed everything. Language models become the new foundation of AI.

2016

XGBoost Paper

Chen & Guestrin's extreme gradient boosting — GBDT on steroids with regularization, column subsampling, and parallel construction. Dominated Kaggle for years.

Optimizes GBDT with L1/L2 regularization on Decision Tree leaves; still the top choice for structured/tabular data even in the deep learning era.

obj = Σ loss(yᵢ, ŷᵢ) + Σ Ω(tree) — loss + regularization (prevents overfitting!)

2016

WaveNet Paper

DeepMind's autoregressive audio model — generates speech sample by sample using dilated causal convolutions to capture long-range patterns.

Applies CNN's convolutions to sequential audio generation; its autoregressive approach parallels GPT's left-to-right text generation.

P(x) = ∏ P(xₜ | x₁,...,xₜ₋₁) — predict each sample from all previous (causal)

2017

Transformer Paper

"Attention Is All You Need" — replace RNNs entirely with self-attention. Each token attends to ALL others in parallel. Multi-head attention captures different relationships.

Generalizes Seq2Seq's attention into a pure-attention architecture with ResNet's skip connections; becomes the backbone of GPT, BERT, ViT, and virtually all modern AI.

Attention(Q,K,V) = softmax(QKᵀ/√d) · V — every token looks at every other token

2018

ELMo Paper

Embeddings from Language Models — word vectors that change based on context! "bank" gets different embeddings in "river bank" vs "bank account".

Makes Word2Vec context-aware using bidirectional LSTM; superseded by BERT's Transformer-based contextual embeddings.

ELMo(word) = γ(s₀·e + s₁·h_forward + s₂·h_backward) — context-dependent embedding

2018

GPT-1 Paper

OpenAI's first Generative Pre-trained Transformer — pretrain on massive text with next-word prediction, then fine-tune for tasks. Proved unsupervised pretraining works.

Applies Transformer decoder to NNLM's next-word prediction paradigm; scales up to GPT-2 and ultimately GPT-3/ChatGPT.

P(wₜ | w₁...wₜ₋₁) via 12-layer Transformer decoder — left-to-right generation

2018

BERT Paper

Google's Bidirectional Encoder — unlike GPT (left-to-right), BERT reads BOTH directions. Pretrained by masking random words and predicting them from full context.

Uses Transformer encoder with ELMo's bidirectional insight; dominates NLP understanding tasks, while GPT wins at generation.

[CLS] The [MASK] sat on the mat [SEP] → predict: cat (bidirectional context!)

2018

StyleGAN Paper

NVIDIA's style-based generator — controls image generation at different scales: coarse features (pose, shape) and fine features (color, texture) via style vectors.

Advances GAN with style-based control at each resolution level; produces photorealistic faces, later surpassed by Diffusion Models.

z → Mapping Network → w → AdaIN at each layer — style control at every resolution

2019

GPT-2 Paper

10× larger than GPT-1 (1.5B params). Showed scaling produces emergent abilities — zero-shot performance without fine-tuning. "Too dangerous to release."

Scales GPT-1 10× to unlock zero-shot abilities; proves the scaling hypothesis that leads to GPT-3 (100× more) and GPT-4.

Same as GPT-1 but 10× bigger → emergent zero-shot abilities without fine-tuning!

2019

T5 (Text-to-Text Transfer Transformer) Paper

Google's unified framework — EVERY NLP task is "text in, text out". Translation? Summarization? Classification? One model, one format.

Unifies Transformer encoder-decoder for ALL tasks via text prefixes; its "everything is text" paradigm merges with GPT-2's prompting to create the modern instruction-following AI.

"translate English to German: That is good" → "Das ist gut" — everything is text-to-text