Explain
Detailed Explanation
The goal of this problem is to perform Bayesian Linear Regression. Unlike the standard maximum likelihood estimation (which leads to Least Squares), in the Bayesian approach, we treat the parameters as random variables with a prior distribution.
1. The Model and Prior
- Likelihood: The relationship between inputs and outputs is linear (in feature space) with added Gaussian noise. This means that if we know , the probability of observing follows a Gaussian distribution centered at the prediction .
- Prior: Before seeing any data, we assume follows a Gaussian distribution centered at 0 with covariance . This encodes our belief that the weights shouldn't be too large (regularization).
2. Deriving the Posterior
We want to find , which represents our updated belief about the parameters after seeing the data. We use Bayes' rule:
Since both the likelihood and the prior are Gaussian, the posterior will also be Gaussian. This is a property of conjugated distributions (Gaussian is conjugate to itself for the mean).
The derivation primarily involves linear algebra manipulation inside the exponential function:
- Exponentials: We multiply the exponential functions of the prior and likelihood. Due to logarithmic rules (), this corresponds to adding the arguments inside the exponentials.
- Quadratic Form: The argument of a multivariate Gaussian exponential is a quadratic form: .
- Completing the Square: We expand the sum of the prior and likelihood arguments and group all terms that contain .
- The terms scaling with (quadratic) determine the Precision Matrix (inverse variance). We find that the posterior precision is the sum of the prior precision () and the data precision ().
- The terms linear in help us find the Posterior Mean.
- Result:
- Posterior Covariance (): It shrinks as we get more data (the term grows).
- Posterior Mean (): It is a weighted combination of the prior mean (0) and the data estimate.
This result is fundamental to Bayesian learning. It shows how data updates our uncertainty () and our best guess () about the model parameters.