Answer
Prerequisites
- Maximum A Posteriori (MAP) Estimation
- Likelihood Function for Linear Regression
- Laplace Distribution
- Gaussian Distribution
Step-by-Step Derivation
-
MAP Estimation Framework: In MAP estimation, we seek to find the parameters that maximize the posterior distribution : Since is independent of , this is equivalent to maximizing the joint probability, which is the product of the likelihood and the prior: Or equivalently, minimizing the negative log-posterior:
-
Likelihood Function: Assuming independent Gaussian noise with variance for simplicity (which corresponds to the term without a scaling variance factor), the likelihood is: Therefore, the negative log-likelihood is:
-
Prior for LASSO: To reconstruct equation (3.59), the negative log-prior must correspond to the L1 penalty: This means each weight follows an independent Laplace distribution with location parameter and scale parameter : Thus, the prior distribution assumed by LASSO is the Laplace distribution (or Double Exponential distribution) centered at zero.
-
Plotting and Interpretation: The Gaussian prior (used in Ridge Regression, L2 regularization) is . The Laplace prior (used in LASSO, L1 regularization) is .
graph TD
subgraph Prior Distributions
direction LR
G[Gaussian Prior: Bell-shaped, rounded peak at 0]
L[Laplace Prior: Sharp peak at 0, heavy tails]
endPlot description: A Gaussian distribution has a smooth, rounded bell shape around 0. A Laplace distribution has a sharp, characteristic "tent" shape that peaks exactly at 0.
Explanation for Sparsity: The Laplace prior has a sharp point (non-differentiable) exactly at , meaning it assigns a strictly higher probability mass to the exact value of compared to the smooth Gaussian prior, which is flat at the origin. When we compute the MAP, the sharp peak of the Laplace prior pulls the parameter exactly to zero unless the data likelihood strongly pulls it away. Consequently, LASSO naturally performs feature selection by setting many weights exactly to zero.