Section VIII · 2020–2024

Foundation Models & The AGI Era

GPT-3, diffusion models, ChatGPT, and multimodal AI — the models reshaping our world right now.

2020

GPT-3 Paper

175B parameters — 100× GPT-2. Scale alone creates emergent abilities: few-shot learning, reasoning, code generation. No fine-tuning needed.

Scales GPT-2 100× to unlock in-context learning; its few-shot paradigm is refined by ChatGPT's RLHF into the modern AI assistant.
175B params | zero/one/few-shot via in-context learning — examples in the prompt = "programming"
2020

ViT (Vision Transformer) Paper

Cut an image into 16×16 patches, treat each as a "token", feed into a standard Transformer. No convolutions needed — attention alone works for vision!

Applies Transformer directly to image patches (replacing CNN); enables unified vision-language models like CLIP and GPT-4's visual understanding.
Image → 16×16 patches → Linear projection → + Position embedding → Transformer Encoder → Class
2021

CLIP Paper

Contrastive Language-Image Pretraining — learns to match images with text. Trained on 400M image-text pairs. Enables zero-shot image classification!

Pairs ViT's image encoder with a Transformer text encoder via contrastive learning; provides the text-image alignment that powers Diffusion and Sora.
maximize similarity(image_embed, matching_text_embed) — contrastive learning across modalities
2020

Diffusion Models Paper

Gradually add noise to an image until destroyed, then train a neural net to reverse the process step by step. Generate by starting from pure noise and denoising.

Revives Denoising Autoencoder's "learn to denoise" principle; combined with CLIP text guidance, surpasses GAN/StyleGAN as the dominant generative paradigm (Stable Diffusion, DALL-E, Midjourney).
Forward: x₀ → x₁ → ... → x_T (pure noise) | Reverse: x_T → ... → x₁ → x₀ (image!)
2022

ChatGPT (RLHF) Paper

The model that changed everything. Three-step training: (1) SFT on human demos, (2) Train reward model on preferences, (3) Optimize with PPO.

Applies RLHF (Reinforcement Learning from Human Feedback) to GPT-3; its alignment approach is refined by Claude's Constitutional AI into scalable AI safety.
Step 1: SFT → Step 2: Reward Model (human prefs) → Step 3: PPO (maximize reward) = RLHF
2023

LLaMA Paper

Meta's open-source LLM — smaller models on MORE data match much larger ones. LLaMA-13B matches GPT-3 (175B)! Sparked the open-source revolution.

Applies Chinchilla scaling laws to the Transformer decoder; proves efficient training beats brute-force scaling, spawning Alpaca, Vicuna, Mistral, and the open-source LLM ecosystem.
Key insight: 13B params + 1T tokens > 175B params + 300B tokens — data matters more than size!
2023

GPT-4 Paper

OpenAI's multimodal model — accepts text AND images. Passes the bar exam, writes code, analyzes charts. Rumored MoE at ~1.8T parameters.

Combines GPT-3's language power with ViT's visual understanding into a multimodal system; uses Mixture-of-Experts for efficiency at massive scale.
Multimodal: text + image → unified understanding → text output | MoE: only activate relevant experts
2024

Claude Paper

Anthropic's AI trained with Constitutional AI — instead of just human feedback, Claude follows principles to self-critique and improve. Emphasizes helpfulness, honesty, and harmlessness.

Improves ChatGPT's RLHF with Constitutional AI (RLAIF): AI self-critiques against written principles → scales alignment without massive human labeling.
RLHF + Constitutional AI: self-critique against principles → RLAIF (AI feedback from constitution)
2024

Sora Paper

OpenAI's video generation — up to 60s of high-fidelity video from text. Uses Diffusion Transformer (DiT) on spacetime patches. Understands physics and 3D consistency.

Merges Diffusion Models with Transformer attention on spacetime patches (extending ViT to video); represents the frontier of generative AI.
Text prompt → Spacetime patches → Diffusion Transformer → Video frames (temporal consistency)