Predictive Distribution of the Latent Function f∗:
We are asked to find the distribution of the noise-free prediction f∗=f(x∗,θ)=ϕ(x∗)Tθ given the data D.
From part (a), we know the posterior of θ is a Gaussian:
p(θ∣D)=N(θ∣μ^θ,Σ^θ)
Since f∗ is a linear combination of the Gaussian random vector θ, f∗ is also Gaussian-distributed. This follows from the rule that if x∼N(μ,Σ), then Ax∼N(Aμ,AΣAT).
Here, the "matrix A" is the row vector ϕ(x∗)T, and the random variable is θ.
- Mean:
μ^∗=E[f∗∣x∗,D]=E[ϕ(x∗)Tθ∣x∗,D]=ϕ(x∗)TE[θ∣D]=ϕ(x∗)Tμ^θ
- Variance:
σ^∗2=Var(f∗∣x∗,D)=Var(ϕ(x∗)Tθ∣x∗,D)=ϕ(x∗)TVar(θ∣D)ϕ(x∗)=ϕ(x∗)TΣ^θϕ(x∗)
Therefore, the predictive distribution for the latent function is:
p(f∗∣x∗,D)=N(f∗∣ϕ(x∗)Tμ^θ,ϕ(x∗)TΣ^θϕ(x∗))=N(f∗∣μ^∗,σ^∗2)
Predictive Distribution of the Output y∗:
The observed target y∗ includes the observation noise: y∗=f∗+ϵ∗, where ϵ∗∼N(0,σ2).
The question asks us to compute the integral:
p(y∗∣x∗,D)=∫p(y∗∣x∗,θ)p(θ∣D)dθ
Using the hint, because y∗ only depends on θ through the deterministic mapping f∗=ϕ(x∗)Tθ, we can marginalize over f∗ instead of high-dimensional θ:
p(y∗∣x∗,D)=∫p(y∗∣f∗)p(f∗∣D)df∗
We know:
- p(y∗∣f∗)=N(y∗∣f∗,σ2) (from y∗=f∗+ϵ∗)
- p(f∗∣D)=N(f∗∣μ^∗,σ^∗2)
This integral represents adding two independent Gaussian variables: f∗∼N(μ^∗,σ^∗2) and ϵ∗∼N(0,σ2).
The sum y∗=f∗+ϵ∗ of two independent Gaussian variables is also Gaussian.
- Mean of y∗: E[y∗]=E[f∗]+E[ϵ∗]=μ^∗+0=μ^∗
- Variance of y∗: Var(y∗)=Var(f∗)+Var(ϵ∗)=σ^∗2+σ2
Therefore, the predictive distribution of y∗ is:
p(y∗∣x∗,D)=N(y∗∣μ^∗,σ2+σ^∗2)
This completes the proof.