Skip to main content

Explain

Detailed Explanation

The goal of this problem is to perform Bayesian Linear Regression. Unlike the standard maximum likelihood estimation (which leads to Least Squares), in the Bayesian approach, we treat the parameters θ\theta as random variables with a prior distribution.

1. The Model and Prior

  • Likelihood: The relationship between inputs and outputs is linear (in feature space) with added Gaussian noise. This means that if we know θ\theta, the probability of observing yy follows a Gaussian distribution centered at the prediction ΦTθ\Phi^T \theta.
  • Prior: Before seeing any data, we assume θ\theta follows a Gaussian distribution centered at 0 with covariance Γ\Gamma. This encodes our belief that the weights shouldn't be too large (regularization).

2. Deriving the Posterior

We want to find p(θD)p(\theta | \mathcal{D}), which represents our updated belief about the parameters after seeing the data. We use Bayes' rule:

PosteriorLikelihood×Prior\text{Posterior} \propto \text{Likelihood} \times \text{Prior}

Since both the likelihood and the prior are Gaussian, the posterior will also be Gaussian. This is a property of conjugated distributions (Gaussian is conjugate to itself for the mean).

The derivation primarily involves linear algebra manipulation inside the exponential function:

  1. Exponentials: We multiply the exponential functions of the prior and likelihood. Due to logarithmic rules (exp(A)exp(B)=exp(A+B)\exp(A)\exp(B) = \exp(A+B)), this corresponds to adding the arguments inside the exponentials.
  2. Quadratic Form: The argument of a multivariate Gaussian exponential is a quadratic form: 12(θμ)TΣ1(θμ)-\frac{1}{2}(\theta - \mu)^T \Sigma^{-1} (\theta - \mu).
  3. Completing the Square: We expand the sum of the prior and likelihood arguments and group all terms that contain θ\theta.
    • The terms scaling with θTθ\theta^T \dots \theta (quadratic) determine the Precision Matrix (inverse variance). We find that the posterior precision is the sum of the prior precision (Γ1\Gamma^{-1}) and the data precision (ΦΣ1ΦT\Phi \Sigma^{-1} \Phi^T).
    • The terms linear in θ\theta help us find the Posterior Mean.
  4. Result:
    • Posterior Covariance (Σ^θ\hat{\Sigma}_\theta): It shrinks as we get more data (the ΦΣ1ΦT\Phi \Sigma^{-1} \Phi^T term grows).
    • Posterior Mean (μ^θ\hat{\mu}_\theta): It is a weighted combination of the prior mean (0) and the data estimate.

This result is fundamental to Bayesian learning. It shows how data updates our uncertainty (Σ\Sigma) and our best guess (μ\mu) about the model parameters.