Skip to main content

Answer

Prerequisites

  • MAP Estimation for Linear Regression
  • Regularization (Ridge Regression / L2 Penalty)
  • Matrix Calculus

Step-by-Step Derivation

  1. Substituting the i.i.d. assumptions into the MAP estimate: From part (b), the MAP estimate is:

    θ^MAP=(Γ1+ΦΣ1ΦT)1ΦΣ1y\hat{\theta}_{MAP} = (\Gamma^{-1} + \Phi \Sigma^{-1} \Phi^T)^{-1} \Phi \Sigma^{-1} y

    We are given that Γ=αI\Gamma = \alpha I and Σ=σ2I\Sigma = \sigma^2 I. Let's substitute these into the MAP equation. The inverses are Γ1=1αI\Gamma^{-1} = \frac{1}{\alpha} I and Σ1=1σ2I\Sigma^{-1} = \frac{1}{\sigma^2} I.

    θ^MAP=(1αI+Φ(1σ2I)ΦT)1Φ(1σ2I)y\hat{\theta}_{MAP} = \left(\frac{1}{\alpha} I + \Phi \left(\frac{1}{\sigma^2} I\right) \Phi^T\right)^{-1} \Phi \left(\frac{1}{\sigma^2} I\right) y
  2. Simplifying the algebraic expression: Pull out the scalar 1σ2\frac{1}{\sigma^2} from the inverse term:

    θ^MAP=[1σ2(σ2αI+ΦΦT)]1Φ(1σ2I)y\hat{\theta}_{MAP} = \left[ \frac{1}{\sigma^2} \left( \frac{\sigma^2}{\alpha} I + \Phi \Phi^T \right) \right]^{-1} \Phi \left(\frac{1}{\sigma^2} I\right) y

    Apply the property (cA)1=1cA1(cA)^{-1} = \frac{1}{c} A^{-1} where cc is a scalar:

    θ^MAP=σ2(σ2αI+ΦΦT)11σ2Φy\hat{\theta}_{MAP} = \sigma^2 \left( \frac{\sigma^2}{\alpha} I + \Phi \Phi^T \right)^{-1} \frac{1}{\sigma^2} \Phi y

    The σ2\sigma^2 terms cancel out:

    θ^MAP=(ΦΦT+σ2αI)1Φy\hat{\theta}_{MAP} = \left( \Phi \Phi^T + \frac{\sigma^2}{\alpha} I \right)^{-1} \Phi y

    By defining λ=σ2α\lambda = \frac{\sigma^2}{\alpha}, we get the desired form:

    θ^MAP=(ΦΦT+λI)1Φy\hat{\theta}_{MAP} = (\Phi \Phi^T + \lambda I)^{-1} \Phi y

    Since α\alpha (prior variance) and σ2\sigma^2 (noise variance) must be non-negative, λ0\lambda \ge 0. This proves the first part.

  3. Solving the Regularized Least-Squares Problem: We want to show that the objective function in equation (3.49) leads to the same solution. Let J(θ)J(\theta) be the objective function to minimize:

    J(θ)=yΦTθ2+λθ2J(\theta) = \lVert y - \Phi^T \theta \rVert^2 + \lambda \lVert \theta \rVert^2

    Expand the norms into vector dot products (x2=xTx||x||^2 = x^Tx):

    J(θ)=(yΦTθ)T(yΦTθ)+λθTθJ(\theta) = (y - \Phi^T \theta)^T (y - \Phi^T \theta) + \lambda \theta^T \theta J(θ)=yTyyTΦTθθTΦy+θTΦΦTθ+λθTθJ(\theta) = y^T y - y^T \Phi^T \theta - \theta^T \Phi y + \theta^T \Phi \Phi^T \theta + \lambda \theta^T \theta

    Note that yTΦTθ=(θTΦy)Ty^T \Phi^T \theta = (\theta^T \Phi y)^T. Since the result is a scalar, it equals its transpose.

    J(θ)=yTy2θTΦy+θT(ΦΦT+λI)θJ(\theta) = y^T y - 2 \theta^T \Phi y + \theta^T (\Phi \Phi^T + \lambda I) \theta
  4. Taking the derivative and setting to zero: To minimize J(θ)J(\theta), we take the gradient with respect to the vector θ\theta and set it to zero:

    θJ(θ)=2Φy+2(ΦΦT+λI)θ=0\nabla_\theta J(\theta) = -2 \Phi y + 2 (\Phi \Phi^T + \lambda I) \theta = 0 (ΦΦT+λI)θ=Φy(\Phi \Phi^T + \lambda I) \theta = \Phi y

    Solving for θ\theta:

    θ^=(ΦΦT+λI)1Φy\hat{\theta} = (\Phi \Phi^T + \lambda I)^{-1} \Phi y

    This is identical to equation (3.48), proving that the Bayesian MAP estimate with isotropic Gaussian priors is mathematically equivalent to solving the frequentist L2L_2 regularized least-squares problem (Ridge regression).