2016
Chen & Guestrin's extreme gradient boosting — GBDT on steroids with regularization, column subsampling, and parallel construction. Dominated Kaggle for years.
Optimizes
GBDT with L1/L2 regularization on
Decision Tree leaves; still the top choice for structured/tabular data even in the deep learning era.
obj = Σ loss(yᵢ, ŷᵢ) + Σ Ω(tree) — loss + regularization (prevents overfitting!)
2016
DeepMind's autoregressive audio model — generates speech sample by sample using dilated causal convolutions to capture long-range patterns.
Applies
CNN's convolutions to sequential audio generation; its autoregressive approach parallels
GPT's left-to-right text generation.
P(x) = ∏ P(xₜ | x₁,...,xₜ₋₁) — predict each sample from all previous (causal)
2018
Embeddings from Language Models — word vectors that change based on context! "bank" gets different embeddings in "river bank" vs "bank account".
Makes
Word2Vec context-aware using bidirectional
LSTM; superseded by
BERT's Transformer-based contextual embeddings.
ELMo(word) = γ(s₀·e + s₁·h_forward + s₂·h_backward) — context-dependent embedding
2018
OpenAI's first Generative Pre-trained Transformer — pretrain on massive text with next-word prediction, then fine-tune for tasks. Proved unsupervised pretraining works.
P(wₜ | w₁...wₜ₋₁) via 12-layer Transformer decoder — left-to-right generation
2018
Google's Bidirectional Encoder — unlike GPT (left-to-right), BERT reads BOTH directions. Pretrained by masking random words and predicting them from full context.
Uses
Transformer encoder with
ELMo's bidirectional insight; dominates NLP understanding tasks, while
GPT wins at generation.
[CLS] The [MASK] sat on the mat [SEP] → predict: cat (bidirectional context!)
2018
NVIDIA's style-based generator — controls image generation at different scales: coarse features (pose, shape) and fine features (color, texture) via style vectors.
Advances
GAN with style-based control at each resolution level; produces photorealistic faces, later surpassed by
Diffusion Models.
z → Mapping Network → w → AdaIN at each layer — style control at every resolution
2019
10× larger than GPT-1 (1.5B params). Showed scaling produces emergent abilities — zero-shot performance without fine-tuning. "Too dangerous to release."
Scales
GPT-1 10× to unlock zero-shot abilities; proves the scaling hypothesis that leads to
GPT-3 (100× more) and
GPT-4.
Same as GPT-1 but 10× bigger → emergent zero-shot abilities without fine-tuning!
2019
T5 (Text-to-Text Transfer Transformer) Paper
Google's unified framework — EVERY NLP task is "text in, text out". Translation? Summarization? Classification? One model, one format.
Unifies
Transformer encoder-decoder for ALL tasks via text prefixes; its "everything is text" paradigm merges with
GPT-2's prompting to create the modern instruction-following AI.
"translate English to German: That is good" → "Das ist gut" — everything is text-to-text