Explain
Explanation
Ideally, we would just maximize independently for each . However, we have a constraint. We are modeling probabilities, so the sum of all must be exactly 1. If we simply increased one to maximize the log-likelihood without penalty, we might break this rule (e.g., they might sum to more than 1).
The Lagrange Multiplier method introduces a new variable (lambda) to enforce this "sum-to-1" rule.
- Recall the formula: The Log Probability of a multinomial distribution involves terms like .
- The Gradient: We want to follow the slope (gradient) of this function to find the top (maximum).
- The Penalty: The term acts as a balancing force.
- When we take the derivative w.r.t , we get .
- This implies is proportional to (specifically ).
- Normalization: Since all must sum to 1, and each is proportional to , the constant of proportionality must ensure the sum is 1.
- Sum of parts = Sum of .
- Therefore, each part is simply its observed count divided by the total count .
This result is intuitive: the Maximum Likelihood Estimate (MLE) for the probability of a category is just the fraction of times that category was observed () out of the total observations ().