Skip to main content

Explain

Intuition

In classical (Frequentist) machine learning, making a prediction is simple: you find one "best" set of weights θ^\hat{\theta}, plug in your new data point xx_*, and spit out a single number y=ϕ(x)Tθ^y_* = \phi(x_*)^T \hat{\theta}.

However, the Bayesian framework acknowledges that we are never 100% sure what the true weights are. We have a whole distribution of possible weights (the posterior p(θD)p(\theta|\mathcal{D})).

So, to make a mathematically rigorous prediction, we must ask every single possible model what it thinks the prediction should be, and then take a vote, weighted by how likely each model is. This is what the integral p(yx,θ)p(θD)dθ\int p(y_*|x_*, \theta) p(\theta|\mathcal{D}) d\theta does.

Two Types of Uncertainty

The beauty of the final formula p(yx,D)=N(yμ^,σ2+σ^2)p(y_*|x_*, \mathcal{D}) = \mathcal{N}(y_*|\hat{\mu}_*, \sigma^2 + \hat{\sigma}^2_*) is that it explicitly breaks down our uncertainty about the future into two separate chunks:

  1. Epistemic Uncertainty (σ^2\hat{\sigma}^2_*): This is the uncertainty we have because we lack knowledge or data.
    • Notice that σ^2=ϕ(x)TΣ^θϕ(x)\hat{\sigma}^2_* = \phi(x_*)^T \hat{\Sigma}_\theta \phi(x_*) depends on Σ^θ\hat{\Sigma}_\theta (our posterior uncertainty about the weights) and xx_*.
    • If you ask the model to predict a point xx_* that is very similar to the training data, this variance will be small.
    • If you ask the model to predict a point wildly far away from any training data, the models will disagree wildly, and σ^2\hat{\sigma}^2_* will skyrocket. This is the model saying, "I don't know, I haven't seen anything like this before!" As we gather more data, this uncertainty shrinks.
  2. Aleatoric Uncertainty (σ2\sigma^2): This is the inherent noise in the universe. Even if we had an infinite amount of training data and knew the "true" line perfectly (ff_*), the actual observed value yy_* will still bounce around that line due to random noise ϵ\epsilon. We can never get rid of this σ2\sigma^2 variance, no matter how much data we collect.