Skip to main content

Answer

Pre-required Knowledge

  1. Probability Density Function (PDF):
    • Gaussian distribution: p(x)=12πσ2exp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right).
  2. Log-Likelihood:
    • Properties of logarithms: ln(ab)=lna+lnb\ln(ab) = \ln a + \ln b, ln(ex)=x\ln(e^x) = x.
  3. Optimization:
    • Maximizing a function is equivalent to maximizing its logarithm (since log is monotonic).

Step-by-Step Answer

1. The Likelihood Function

We observe yi=ϕ(xi)Tθ+ϵiy_i = \phi(x_i)^T \theta + \epsilon_i where ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2). This implies that given xix_i and θ\theta, yiy_i follows a Gaussian distribution with mean μi=ϕ(xi)Tθ\mu_i = \phi(x_i)^T \theta and variance σ2\sigma^2:

p(yixi,θ)=12πσ2exp((yiϕ(xi)Tθ)22σ2)p(y_i | x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right)

Since the samples are independent and identically distributed (i.i.d), the likelihood of the entire dataset is the product of individual probabilities:

L(θ)=p(Dθ)=i=1np(yixi,θ)L(\theta) = p(\mathcal{D} | \theta) = \prod_{i=1}^n p(y_i | x_i, \theta) L(θ)=i=1n12πσ2exp((yiϕ(xi)Tθ)22σ2)L(\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right)

2. The Log-Likelihood Function

It is easier to maximize the log-likelihood (θ)=lnL(θ)\ell(\theta) = \ln L(\theta) because it turns the product into a sum.

(θ)=ln(i=1n12πσ2exp((yiϕ(xi)Tθ)22σ2))=i=1n(ln12πσ2+lnexp((yiϕ(xi)Tθ)22σ2))=i=1n(12ln(2πσ2)(yiϕ(xi)Tθ)22σ2)\begin{aligned} \ell(\theta) &= \ln \left( \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right) \right) \\ &= \sum_{i=1}^n \left( \ln \frac{1}{\sqrt{2\pi\sigma^2}} + \ln \exp\left( -\frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right) \right) \\ &= \sum_{i=1}^n \left( -\frac{1}{2} \ln(2\pi\sigma^2) - \frac{(y_i - \phi(x_i)^T \theta)^2}{2\sigma^2} \right) \end{aligned}

Simplifying:

(θ)=n2ln(2πσ2)12σ2i=1n(yiϕ(xi)Tθ)2\ell(\theta) = -\frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2

3. Maximization

To find the ML estimate θ^ML\hat{\theta}_{ML}, we maximize (θ)\ell(\theta) with respect to θ\theta. Notice that the first term n2ln(2πσ2)-\frac{n}{2} \ln(2\pi\sigma^2) is constant with respect to θ\theta and can be ignored. Maximizing the remaining term is equivalent to maximizing:

12σ2i=1n(yiϕ(xi)Tθ)2-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2

Since 12σ2>0\frac{1}{2\sigma^2} > 0, maximizing this negative quantity is equivalent to minimizing the positive quantity inside the sum:

θ^ML=argminθi=1n(yiϕ(xi)Tθ)2\hat{\theta}_{ML} = \arg\min_\theta \sum_{i=1}^n (y_i - \phi(x_i)^T \theta)^2

This objective function is exactly the sum-squared-error from Part (a). Therefore, minimizing the sum of squared errors is equivalent to maximizing the likelihood under the assumption of Gaussian noise. The solution is the same:

θ^ML=(ΦΦT)1Φy\hat{\theta}_{ML} = (\Phi \Phi^T)^{-1} \Phi y