Explain
Detailed Explanation
This section creates the bridge between Probabilistic Modeling (Bayesian regression) and Optimization-based Machine Learning (Ridge Regression).
1. The Regularization Parameter
We found that .
- : Noise variance. High noise means we trust data less.
- : Prior variance. High variance means a "flat" or weak prior.
- Interpretation:
- If noise is high, increases. We rely more on the prior (regularization) because data is noisy.
- If prior variance is high, decreases. We rely less on the prior because it is uninformative.
- So, balances the trade-off between fitting the data and keeping parameters small.
2. Ridge Regression / Tikhonov Regularization
The optimization problem has two parts:
- Data Fidelity: . Try to predict well.
- Regularization: . Try to keep weights small (close to 0).
The Bayesian derivation provides a justification for why we add the term: it corresponds mathematically to assuming a Gaussian prior on the weights.
Isotropic Gaussian Prior Regularization (Ridge). Laplace Prior Regularization (Lasso).
3. Numerical Advantage
The matrix is often singular or close to singular. Adding adds a small positive number to the diagonal elements. This ensures the eigenvalues are all positive, making the matrix invertible and the solution stable.