Skip to main content

Answer

Prerequisites

  • Bayes' Theorem
  • Multivariate Gaussian Distribution
  • Completing the Square for Matrices

Step-by-Step Derivation

  1. Bayes' Theorem: The posterior distribution of the parameters θ\theta given the data D\mathcal{D} can be found using Bayes' rule:

    p(θD)=p(yθ,X)p(θ)p(yX)p(yθ,X)p(θ)p(\theta | \mathcal{D}) = \frac{p(y | \theta, X) p(\theta)}{p(y | X)} \propto p(y | \theta, X) p(\theta)

    where yy contains the targets and XX contains all features. To find the posterior, we can work with the unnormalized log-posterior:

    lnp(θD)=lnp(yθ,X)+lnp(θ)+const\ln p(\theta | \mathcal{D}) = \ln p(y | \theta, X) + \ln p(\theta) + \text{const}
  2. Likelihood and Prior: From the model equation y=ΦTθ+ϵy = \Phi^T \theta + \epsilon and ϵN(0,Σ)\epsilon \sim \mathcal{N}(0, \Sigma), the likelihood is yθ,XN(ΦTθ,Σ)y|\theta, X \sim \mathcal{N}(\Phi^T \theta, \Sigma). Hence:

    lnp(yθ,X)=12(yΦTθ)TΣ1(yΦTθ)+const\ln p(y | \theta, X) = -\frac{1}{2} (y - \Phi^T \theta)^T \Sigma^{-1} (y - \Phi^T \theta) + \text{const}

    The prior is p(θ)=N(0,Γ)p(\theta) = \mathcal{N}(0, \Gamma), so:

    lnp(θ)=12θTΓ1θ+const\ln p(\theta) = -\frac{1}{2} \theta^T \Gamma^{-1} \theta + \text{const}
  3. Log-Posterior: Adding the log-likelihood and log-prior:

    lnp(θD)=12((yΦTθ)TΣ1(yΦTθ)+θTΓ1θ)+const=12(yTΣ1yyTΣ1ΦTθθTΦΣ1y+θTΦΣ1ΦTθ+θTΓ1θ)+const\begin{align*} \ln p(\theta | \mathcal{D}) &= -\frac{1}{2} \left( (y - \Phi^T \theta)^T \Sigma^{-1} (y - \Phi^T \theta) + \theta^T \Gamma^{-1} \theta \right) + \text{const} \\ &= -\frac{1}{2} \left( y^T \Sigma^{-1} y - y^T \Sigma^{-1} \Phi^T \theta - \theta^T \Phi \Sigma^{-1} y + \theta^T \Phi \Sigma^{-1} \Phi^T \theta + \theta^T \Gamma^{-1} \theta \right) + \text{const} \end{align*}

    Noting that yTΣ1yy^T \Sigma^{-1} y is a constant with respect to θ\theta, and yTΣ1ΦTθ=(θTΦΣ1y)Ty^T \Sigma^{-1} \Phi^T \theta = (\theta^T \Phi \Sigma^{-1} y)^T is a scalar (so they are identical), we collect the terms quadratic and linear in θ\theta:

    lnp(θD)=12[θT(Γ1+ΦΣ1ΦT)θ2θTΦΣ1y]+const\ln p(\theta | \mathcal{D}) = -\frac{1}{2} \left[ \theta^T (\Gamma^{-1} + \Phi \Sigma^{-1} \Phi^T) \theta - 2 \theta^T \Phi \Sigma^{-1} y \right] + \text{const}
  4. Completing the Square: We want to express this in the form of a general normal distribution log-pdf:

    lnN(θμ^θ,Σ^θ)=12(θμ^θ)TΣ^θ1(θμ^θ)+const\ln \mathcal{N}(\theta | \hat{\mu}_\theta, \hat{\Sigma}_\theta) = -\frac{1}{2} (\theta - \hat{\mu}_\theta)^T \hat{\Sigma}_\theta^{-1} (\theta - \hat{\mu}_\theta) + \text{const}'

    Expanding this form yields:

    12(θTΣ^θ1θ2θTΣ^θ1μ^θ+μ^θTΣ^θ1μ^θ)+const-\frac{1}{2} \left( \theta^T \hat{\Sigma}_\theta^{-1} \theta - 2 \theta^T \hat{\Sigma}_\theta^{-1} \hat{\mu}_\theta + \hat{\mu}_\theta^T \hat{\Sigma}_\theta^{-1} \hat{\mu}_\theta \right) + \text{const}'

    Comparing the quadratic term in θ\theta:

    Σ^θ1=Γ1+ΦΣ1ΦT    Σ^θ=(Γ1+ΦΣ1ΦT)1\hat{\Sigma}_\theta^{-1} = \Gamma^{-1} + \Phi \Sigma^{-1} \Phi^T \implies \hat{\Sigma}_\theta = (\Gamma^{-1} + \Phi \Sigma^{-1} \Phi^T)^{-1}

    Comparing the linear term in θ\theta:

    Σ^θ1μ^θ=ΦΣ1y    μ^θ=Σ^θΦΣ1y=(Γ1+ΦΣ1ΦT)1ΦΣ1y\hat{\Sigma}_\theta^{-1} \hat{\mu}_\theta = \Phi \Sigma^{-1} y \implies \hat{\mu}_\theta = \hat{\Sigma}_\theta \Phi \Sigma^{-1} y = (\Gamma^{-1} + \Phi \Sigma^{-1} \Phi^T)^{-1} \Phi \Sigma^{-1} y
  5. Conclusion: Thus, the posterior is a Gaussian distribution p(θD)=N(θμ^θ,Σ^θ)p(\theta|\mathcal{D}) = \mathcal{N}(\theta|\hat{\mu}_\theta, \hat{\Sigma}_\theta) with the required mean and covariance.