Skip to main content

Explain

Detailed Explanation

This section creates the bridge between Probabilistic Modeling (Bayesian regression) and Optimization-based Machine Learning (Ridge Regression).

1. The Regularization Parameter λ\lambda

We found that λ=σ2α\lambda = \frac{\sigma^2}{\alpha}.

  • σ2\sigma^2: Noise variance. High noise means we trust data less.
  • α\alpha: Prior variance. High variance means a "flat" or weak prior.
  • Interpretation:
    • If noise σ2\sigma^2 is high, λ\lambda increases. We rely more on the prior (regularization) because data is noisy.
    • If prior variance α\alpha is high, λ\lambda decreases. We rely less on the prior because it is uninformative.
    • So, λ\lambda balances the trade-off between fitting the data and keeping parameters small.

2. Ridge Regression / Tikhonov Regularization

The optimization problem argminθyΦTθ2+λθ2\operatorname{argmin}_\theta \|y - \Phi^T \theta\|^2 + \lambda \|\theta\|^2 has two parts:

  1. Data Fidelity: yΦTθ2\|y - \Phi^T \theta\|^2. Try to predict yy well.
  2. Regularization: λθ2\lambda \|\theta\|^2. Try to keep weights θ\theta small (close to 0).

The Bayesian derivation provides a justification for why we add the λθ2\lambda \|\theta\|^2 term: it corresponds mathematically to assuming a Gaussian prior on the weights.

Isotropic Gaussian Prior     \iff L2L_2 Regularization (Ridge). Laplace Prior     \iff L1L_1 Regularization (Lasso).

3. Numerical Advantage

The matrix ΦΦT\Phi \Phi^T is often singular or close to singular. Adding λI\lambda I adds a small positive number to the diagonal elements. This ensures the eigenvalues are all positive, making the matrix invertible and the solution stable.