Skip to main content

Answer: Least Squares Regression

Pre-required Knowledge

  • Matrix Calculus: Specifically, derivatives of vector norms and quadratic forms.
    • x(Axb)T(Axb)=2AT(Axb)\frac{\partial}{\partial x} (Ax - b)^T (Ax - b) = 2A^T(Ax - b) for symmetric inner products or standard derivations.
    • xxTAx=(A+AT)x\frac{\partial}{\partial x} x^T A x = (A + A^T)x.
  • Linear Algebra: Concept of rank, inverse, and transpose.

1. Matrix Formulation

We are given:

  • y=[y1,,yn]TRny = [y_1, \dots, y_n]^T \in \mathbb{R}^n
  • Φ=[ϕ(x1),,ϕ(xn)]RD×n\Phi = [\phi(x_1), \dots, \phi(x_n)] \in \mathbb{R}^{D \times n}
  • θRD\theta \in \mathbb{R}^D

We can write the vector of predicted values as:

y^=[ϕ(x1)Tθϕ(xn)Tθ]=[ϕ(x1)Tϕ(xn)T]θ=ΦTθ\hat{y} = \left[ \begin{matrix} \phi(x_1)^T \theta \\ \vdots \\ \phi(x_n)^T \theta \end{matrix} \right] = \left[ \begin{matrix} \phi(x_1)^T \\ \vdots \\ \phi(x_n)^T \end{matrix} \right] \theta = \Phi^T \theta

Note that ΦT\Phi^T is an n×Dn \times D matrix.

The sum-squared-error is:

E(θ)=i=1n(yiϕ(xi)Tθ)2=yΦTθ2E(\theta) = \sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2 = \| y - \Phi^T \theta \|^2

2. Derivation

We want to minimize E(θ)E(\theta) with respect to θ\theta. Let's expand the squared norm:

E(θ)=(yΦTθ)T(yΦTθ)=(yTθTΦ)(yΦTθ)=yTyyTΦTθθTΦy+θTΦΦTθ\begin{aligned} E(\theta) &= (y - \Phi^T \theta)^T (y - \Phi^T \theta) \\ &= (y^T - \theta^T \Phi) (y - \Phi^T \theta) \\ &= y^T y - y^T \Phi^T \theta - \theta^T \Phi y + \theta^T \Phi \Phi^T \theta \end{aligned}

Since yTΦTθy^T \Phi^T \theta is a scalar, it equals its transpose θTΦy\theta^T \Phi y. Thus:

E(θ)=yTy2θTΦy+θTΦΦTθE(\theta) = y^T y - 2 \theta^T \Phi y + \theta^T \Phi \Phi^T \theta

Now, we take the gradient with respect to θ\theta and set it to zero:

θE(θ)=2Φy+2ΦΦTθ=0\nabla_\theta E(\theta) = -2 \Phi y + 2 \Phi \Phi^T \theta = 0

(Using the identity x(xTAx)=2Ax\nabla_x (x^T A x) = 2Ax when AA is symmetric, here A=ΦΦTA = \Phi \Phi^T)

3. Solution (Normal Equation)

Rearranging the equation:

ΦΦTθ=Φy\Phi \Phi^T \theta = \Phi y

Assuming ΦΦT\Phi \Phi^T (a D×DD \times D matrix) is invertible (which effectively means the features are linearly independent and we have enough data), we get:

θ^LS=(ΦΦT)1Φy\hat{\theta}_{LS} = (\Phi \Phi^T)^{-1} \Phi y