Explain
Intuition
The objective we are trying to maximize, , behaves like an unnormalized log-likelihood. Imagine you just observed an event happening times for the -th outcome. To maximize the likelihood of those independent events under a categorical distribution, you want to assign higher probabilities () to components that have larger observed counts ().
Without any restrictions, this function would shoot off to infinity because we would just make every as large as possible.
However, we are restrained by a budget constraint: all the probabilities must sum exactly to 1 (). You can think of this as having exactly (or ) worth of probability mass to distribute among different buckets.
How Lagrange Multipliers Help
Lagrange multipliers provide an elegant way to deal with this budget.
By setting up the Lagrangian , the parameter acts as an internal "price" or "exchange rate" that enforces our budget.
- The derivative tells us that the optimal allocation is .
- This means the probability we assign to bucket should be directly proportional to its count .
- The multiplier physically represents the total normalizing constant (the sum of all counts) needed to ensure the sum of equals 1.
graph TD
A["Objective<br>(Maximize weighted log-probabilities)"] --> B{Budget Constraint}
B -->|Sum of probabilities = 1| C["Lagrange Multiplier Method<br>(Finds 'exchange rate' $\lambda$)"]
C --> D[Result: Proportional Allocation]
D --> E["$$\pi_j = \frac{N_j}{\text{Total } N}$$"]
Common Pitfalls
- Ignoring the multiplier: Sometimes students will differentiate directly, getting , which is unsolved or undefined. You cannot optimize constrained probabilities without accounting for the sum-to-1 bound.
- Forgetting that is a shared constant: In the step where we sum , remember that has no subscript. It is the identical scaling factor applied to correct every probability simultaneously.