Skip to main content

Explain

1. Matrix Dimensions Formulation

In standard textbooks, the design matrix X\mathbf{X} is often defined as N×DN \times D, where each row is a sample. However, in this problem, Φ\Phi is defined as D×ND \times N, where each column is a sample ϕ(xi)\phi(x_i).

  • Φ=[ϕ(x1),,ϕ(xn)]\Phi = [\phi(x_1), \dots, \phi(x_n)] has dimension D×ND \times N.
  • yy has dimension N×1N \times 1.
  • θ\theta has dimension D×1D \times 1.

So the linear model prediction for all samples is ΦTθ\Phi^T \theta (N×D×D×1=N×1N \times D \times D \times 1 = N \times 1), which matches the dimension of yy. This is why the term is yΦTθ2\|y - \Phi^T \theta\|^2.

2. Geometric Interpretation of Projection

The equation ΦΦTθ=Φy\Phi \Phi^T \theta = \Phi y can be rewritten as:

i=1nϕ(xi)ϕ(xi)Tθ=i=1nϕ(xi)yi\sum_{i=1}^n \phi(x_i) \phi(x_i)^T \theta = \sum_{i=1}^n \phi(x_i) y_i

The error vector is e=yy^=yΦTθe = y - \hat{y} = y - \Phi^T \theta. At the optimal solution, the error vector ee is orthogonal to the column space of the design matrix (or row space of Φ\Phi in this notation). Mathematically, Φe=0    Φ(yΦTθ)=0\Phi e = 0 \implies \Phi(y - \Phi^T \theta) = 0, which leads directly to the normal equation.

3. "Least Squares" Intuition

We want to find a line (or polynomial curve) that passes closest to all points. "Closest" is defined by the vertical distance (residual) between the point and the line. We square these distances so that positive and negative errors don't cancel each other out, and to penalize large errors more heavily. Minimizing this sum of squared errors gives us the "Least Squares" solution.