Explain
Detailed Explanation
The core concept in this question is connecting the optimization formulation of regularization with Bayesian statistics.
MAP Estimation
In Bayesian estimation, we have detailed that: Taking the negative logarithm turns the product into a sum: Minimizing the objective function is essentially performing MAP estimation where comes from the likelihood and comes from the prior.
- L2 Regularization (Ridge) uses , which corresponds to a Gaussian prior.
- L1 Regularization (LASSO) uses , which corresponds to a Laplacian prior.
Why Sparsity?
Visually, if we plot the contours of the likelihood (ellipses) and the contours of the prior (constraint region), the solution is where they touch.
- For L2, the constraint region () is a circle/sphere. The likelihood ellipses usually touch the circle at a predictable point on the curve, rarely exactly on an axis (where a weight is 0).
- For L1, the constraint region () is a diamond/cross-polytope. This shape has "corners" on the axes. Geometric probability dictates that the expanding likelihood ellipses are very likely to hit the "corners" first. Since corners lie on the axes, other coordinates are zero, leading to a sparse solution.
Mathematically, the derivative of is , which is or . It does not go to 0 as gets small. This constant force pushes coefficients all the way to 0. The derivative of is , which goes to 0 as gets small, providing diminishing force.