Answer
Prerequisites
- Maximum A Posteriori (MAP) Estimation
- Least Squares and Weighted Least Squares
Step-by-Step Derivation
-
Finding the MAP Estimate: The Maximum A Posteriori (MAP) estimate is the mode of the posterior distribution . Since the posterior is a Gaussian distribution as derived in part (a), and the mode of a Gaussian distribution is equal to its mean, the MAP estimate is simply the posterior mean:
-
Comparison with Other Estimates:
- Ordinary Least Squares (OLS): The standard least squares estimate is derived by assuming i.i.d. noise (i.e., ) and no prior (or a uniform, improper prior representing no regularization). OLS yields:
- Weighted Least Squares (WLS): Unweighted least squares can be extended to account for heteroscedastic or correlated noise . The WLS estimate (which coincides with the Maximum Likelihood Estimate, MLE, under normal noise with covariance ) is:
- Difference: The MAP estimate differs from the WLS estimate exactly by the addition of the term inside the inverse.
-
Role of the New Term (): The matrix is the precision matrix (inverse of the covariance) of the prior distribution .
- It acts as a regularization term.
- Since our prior is centered at zero (), mathematically penalizes parameter vectors that have large magnitudes.
- Instead of letting the parameters grow unboundedly to perfectly fit the noise in the training set, pulls the MAP estimate towards .
-
Advantage of Non-Zero Prior Precision: Yes, there are significant advantages to having a non-zero (meaning a finite prior covariance):
- Prevents Overfitting: By penalizing large parameter values, the model is less sensitive to noise in the training data, leading to better generalization on unseen data.
- Numerical Stability: In cases where we fall into the regime (more features than data points), or when features are highly collinear, the matrix can be non-invertible (singular) or ill-conditioned. Adding a positive-definite matrix ensures that is strictly positive-definite and safely invertible.