Skip to main content

Answer

Prerequisites

  • Linear Transformation of Gaussian Variables (Problem 1.1)
  • Marginalization / Predictive Distribution

Step-by-Step Derivation

  1. Predictive Distribution of the Latent Function ff_*: We are asked to find the distribution of the noise-free prediction f=f(x,θ)=ϕ(x)Tθf_* = f(x_*, \theta) = \phi(x_*)^T \theta given the data D\mathcal{D}. From part (a), we know the posterior of θ\theta is a Gaussian:

    p(θD)=N(θμ^θ,Σ^θ)p(\theta | \mathcal{D}) = \mathcal{N}(\theta | \hat{\mu}_\theta, \hat{\Sigma}_\theta)

    Since ff_* is a linear combination of the Gaussian random vector θ\theta, ff_* is also Gaussian-distributed. This follows from the rule that if xN(μ,Σ)x \sim \mathcal{N}(\mu, \Sigma), then AxN(Aμ,AΣAT)Ax \sim \mathcal{N}(A\mu, A\Sigma A^T). Here, the "matrix AA" is the row vector ϕ(x)T\phi(x_*)^T, and the random variable is θ\theta.

    • Mean: μ^=E[fx,D]=E[ϕ(x)Tθx,D]=ϕ(x)TE[θD]=ϕ(x)Tμ^θ\hat{\mu}_* = \mathbb{E}[f_* | x_*, \mathcal{D}] = \mathbb{E}[\phi(x_*)^T \theta | x_*, \mathcal{D}] = \phi(x_*)^T \mathbb{E}[\theta | \mathcal{D}] = \phi(x_*)^T \hat{\mu}_\theta
    • Variance: σ^2=Var(fx,D)=Var(ϕ(x)Tθx,D)=ϕ(x)TVar(θD)ϕ(x)=ϕ(x)TΣ^θϕ(x) \hat{\sigma}^2_* = \text{Var}(f_* | x_*, \mathcal{D}) = \text{Var}(\phi(x_*)^T \theta | x_*, \mathcal{D}) = \phi(x_*)^T \text{Var}(\theta | \mathcal{D}) \phi(x_*) = \phi(x_*)^T \hat{\Sigma}_\theta \phi(x_*) Therefore, the predictive distribution for the latent function is: p(fx,D)=N(fϕ(x)Tμ^θ,ϕ(x)TΣ^θϕ(x))=N(fμ^,σ^2)p(f_*|x_*, \mathcal{D}) = \mathcal{N}(f_* | \phi(x_*)^T \hat{\mu}_\theta, \phi(x_*)^T \hat{\Sigma}_\theta \phi(x_*)) = \mathcal{N}(f_*|\hat{\mu}_*, \hat{\sigma}^2_*)
  2. Predictive Distribution of the Output yy_*: The observed target yy_* includes the observation noise: y=f+ϵy_* = f_* + \epsilon_*, where ϵN(0,σ2)\epsilon_* \sim \mathcal{N}(0, \sigma^2). The question asks us to compute the integral:

    p(yx,D)=p(yx,θ)p(θD)dθp(y_*|x_*, \mathcal{D}) = \int p(y_*|x_*, \theta) p(\theta|\mathcal{D}) d\theta

    Using the hint, because yy_* only depends on θ\theta through the deterministic mapping f=ϕ(x)Tθf_* = \phi(x_*)^T \theta, we can marginalize over ff_* instead of high-dimensional θ\theta:

    p(yx,D)=p(yf)p(fD)dfp(y_*|x_*, \mathcal{D}) = \int p(y_* | f_*) p(f_* | \mathcal{D}) df_*

    We know:

    • p(yf)=N(yf,σ2)p(y_* | f_*) = \mathcal{N}(y_* | f_*, \sigma^2) (from y=f+ϵy_* = f_* + \epsilon_*)
    • p(fD)=N(fμ^,σ^2)p(f_* | \mathcal{D}) = \mathcal{N}(f_* | \hat{\mu}_*, \hat{\sigma}^2_*)

    This integral represents adding two independent Gaussian variables: fN(μ^,σ^2)f_* \sim \mathcal{N}(\hat{\mu}_*, \hat{\sigma}^2_*) and ϵN(0,σ2)\epsilon_* \sim \mathcal{N}(0, \sigma^2). The sum y=f+ϵy_* = f_* + \epsilon_* of two independent Gaussian variables is also Gaussian.

    • Mean of yy_*: E[y]=E[f]+E[ϵ]=μ^+0=μ^\mathbb{E}[y_*] = \mathbb{E}[f_*] + \mathbb{E}[\epsilon_*] = \hat{\mu}_* + 0 = \hat{\mu}_*
    • Variance of yy_*: Var(y)=Var(f)+Var(ϵ)=σ^2+σ2\text{Var}(y_*) = \text{Var}(f_*) + \text{Var}(\epsilon_*) = \hat{\sigma}^2_* + \sigma^2

    Therefore, the predictive distribution of yy_* is:

    p(yx,D)=N(yμ^,σ2+σ^2)p(y_*|x_*, \mathcal{D}) = \mathcal{N}(y_*|\hat{\mu}_*, \sigma^2 + \hat{\sigma}^2_*)

    This completes the proof.