Problem 3.8(c) Explanation
Why integrate?
In Bayesian prediction, we don't just pick one "best" value of and predict using that (which is what we do in MLE or MAP). Instead, we consider all possible values of , weighted by how likely they are given the data (the posterior).
If the posterior is sharp around 0.7, then values near 0.7 dominate this integral. If the posterior is broad (high uncertainty), the integral averages out the predictions from many different 's.
Laplace Smoothing
The result is historically famous as Laplace's Rule of Succession. Imagine you see the sun rise for days in a row ().
- MLE says probability of sun rising tomorrow is (100% certainty). This is risky; just because it happened before doesn't logically guarantee it will happen forever.
- Bayesian estimate with uniform prior says . It's very close to 1 for large , but never exactly 1. It leaves a tiny probability for the "black swan" event.
Connection to Pseudocounts
The parameters of the Beta prior, and , can be directly interpreted as pseudocounts.
- Uniform prior: Beta(1, 1).
- Effective Successes ... Wait, the pseudocounts match the parameters directly?
- Let's check the mean of Beta(). It is .
- Posterior is Beta().
- Mean is .
- This is consistent with starting with .
- "Virtual samples" count = .
- Virtual successes = .
- Virtual failures = .
- So yes, prior accounts for 1 success and 1 failure.