Answer
Pre-required Knowledge
- MAP Estimation: Maximum A Posteriori estimation finds parameters that maximize the posterior probability: . This is equivalent to minimizing .
- Linear Regression Likelihood: Assuming Gaussian noise , the negative log-likelihood is proportional to the squared error .
- Laplace Distribution: A probability distribution defined as .
Step-by-step Answer
-
Likelihood Term: The first term in the minimization is . This corresponds to the negative log-likelihood of the data assuming the target values are generated with Gaussian noise: , where . Specifically, .
-
Prior Term: The second term is . We want this to correspond to the negative log-prior: . So, . This separates into independent priors for each weight: , where . This implies .
-
Identify the Distribution: The distribution is the Laplace distribution (or Double Exponential distribution) centered at 0 with scale parameter related to inverse . So, LASSO assumes a Laplacian Prior on the weights.
-
Plot Comparison:
- Gaussian Prior (L2 regularization): . This is a bell curve, smooth at the peak 0.
- Laplacian Prior (L1 regularization): . This has a sharp peak at 0.
(Ideally, insert a plot here showing a bell curve vs a sharp peak at 0) The Gaussian is flat near 0, meaning it doesn't distinguish much between 0 and small values (0.001). The Laplacian is sharp at 0, meaning there is a much higher probability density concentrated exactly at 0 compared to small non-zero values.
-
Explanation for Sparsity: Because the Laplace prior has a sharp peak (singularity in derivative) at zero, the mode of the posterior is more likely to fall exactly at zero. In the log-domain, the penalty has a constant gradient even as . This constant "pull" can force the optimal weight to be exactly zero. In contrast, the squared penalty has a gradient which vanishes as . As the weight gets small, the pull towards zero becomes negligible, so it rarely settles exactly at zero.