Section I · 1800s–1960s

The Dawn of Statistical Learning

From Gauss's least squares to Rosenblatt's perceptron — mathematics lays the foundation for machine intelligence.

1805

Linear Regression Paper

Legendre & Gauss's method of least squares — fit a straight line to scattered data by minimizing the sum of squared errors.

The mathematical foundation for all optimization-based learning; directly leads to Adaline and modern Backpropagation.
y = wx + b   minimize Σ(yᵢ - (wxᵢ + b))²
1812

Bayes' Theorem Paper

Update your belief based on new evidence. Prior × likelihood → posterior. The core framework of probabilistic inference.

The foundation of probabilistic reasoning; directly enables Naive Bayes classifiers and GMM+EM clustering.
P(A|B) = P(B|A)·P(A) / P(B)   posterior = likelihood × prior / evidence
1906

Markov Chain Paper

Memoryless state transitions — the next state depends only on the current state. Cornerstone of HMM, MCMC, and PageRank.

Inspires sequential modeling; its memoryless limitation motivates RNN (which adds memory) and Boltzmann Machine sampling.
P(Xₙ₊₁|Xₙ,Xₙ₋₁,...) = P(Xₙ₊₁|Xₙ)   memoryless property
1958

Perceptron Paper

Rosenblatt's first artificial neuron that could learn from data. Computes weighted sum of inputs; output 1 if above threshold, 0 otherwise.

Builds on Linear Regression with a step activation; its linear-only limitation is fixed by Adaline and later Backpropagation.
y = step(w·x + b)   update: w += η(target - y)·x
1960

Adaline Paper

Widrow & Hoff's adaptive linear neuron — unlike the Perceptron, Adaline computes error before the activation function, enabling true gradient descent (LMS rule). The decision boundary glides smoothly into place!

Improves on Perceptron by using continuous gradient descent instead of discrete updates; the LMS rule directly inspires Backpropagation.
z = w·x + b   Δw = η(target − z)·x   error on raw output, not after step()