Skip to main content

Answer

Prerequisites

  • Maximum Likelihood Estimation (MLE): Finding parameter values that maximize the likelihood of making the observations given the parameters.
  • Gaussian Distribution: Probability density function of a normal distribution.
  • Log-Likelihood: Taking the logarithm of the likelihood function to simplify products into sums.

Step-by-Step Derivation

  1. Define the Probability Model We are given that yi=ϕ(xi)Tθ+ϵiy_i = \phi(x_i)^T \theta + \epsilon_i, where ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2). Because ϕ(xi)Tθ\phi(x_i)^T \theta is a deterministic value for a given xix_i and θ\theta, the distribution of yiy_i is a Gaussian centered at ϕ(xi)Tθ\phi(x_i)^T \theta:

    p(yixi,θ)=12πσ2exp((yiϕ(xi)Tθ)22σ2)p(y_i \mid x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right)
  2. Write the Likelihood Function Since the samples D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n are independently and identically distributed (i.i.d.), the joint likelihood of all nn observations is the product of their individual probabilities:

    L(θ)=p(y1,,ynx1,,xn,θ)=i=1np(yixi,θ)=i=1n12πσ2exp((yiϕ(xi)Tθ)22σ2)\begin{aligned} L(\theta) &= p(y_1, \dots, y_n \mid x_1, \dots, x_n, \theta) \\ &= \prod_{i=1}^n p(y_i \mid x_i, \theta) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right) \end{aligned}
  3. Compute the Log-Likelihood Function To find the maximum, it is mathematically much simpler to maximize the natural logarithm of the likelihood, lnL(θ)\ln L(\theta), often denoted as (θ)\ell(\theta). The logarithm is a monotonically increasing function, so maximizing (θ)\ell(\theta) is equivalent to maximizing L(θ)L(\theta).

    (θ)=ln(i=1n12πσ2exp((yiϕ(xi)Tθ)22σ2))=i=1n(ln[12πσ2](yiϕ(xi)Tθ)22σ2)=n2ln(2πσ2)12σ2i=1n(yiϕ(xi)Tθ)2\begin{aligned} \ell(\theta) &= \ln \left( \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right) \right) \\ &= \sum_{i=1}^n \left( \ln\left[\frac{1}{\sqrt{2\pi\sigma^2}}\right] - \frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right) \\ &= - \frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2 \end{aligned}
  4. Show Equivalence to Least Squares We want to find θ\theta that maximizes (θ)\ell(\theta). Notice that the first term n2ln(2πσ2)- \frac{n}{2} \ln(2\pi\sigma^2) is a constant with respect to θ\theta, and the factor 12σ2\frac{1}{2\sigma^2} is a positive constant. Therefore, maximizing the negative term is exactly equivalent to minimizing the positive summation:

    argmaxθ(θ)=argminθi=1n(yiϕ(xi)Tθ)2\arg\max_\theta \ell(\theta) = \arg\min_\theta \sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2

    This summation is exactly the sum-squared-error objective function J(θ)J(\theta) from part (a):

    i=1n(yiϕ(xi)Tθ)2=yΦTθ2\sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2 = \| y - \Phi^T \theta \|^2
  5. Conclusion Since the optimization problem is identical, the Maximum Likelihood (ML) estimate θ^ML\hat{\theta}_{ML} must be equivalent to the Least Squares estimate θ^LS\hat{\theta}_{LS}:

    θ^ML=(ΦΦT)1Φy\hat{\theta}_{ML} = (\Phi \Phi^T)^{-1} \Phi y