Skip to main content

Explain

Intuition

The objective we are trying to maximize, Njlogπj\sum N_j \log \pi_j, behaves like an unnormalized log-likelihood. Imagine you just observed an event happening NjN_j times for the jj-th outcome. To maximize the likelihood of those independent events under a categorical distribution, you want to assign higher probabilities (πj\pi_j) to components that have larger observed counts (NjN_j).

Without any restrictions, this function would shoot off to infinity because we would just make every πj\pi_j as large as possible.

However, we are restrained by a budget constraint: all the probabilities must sum exactly to 1 (πj=1\sum \pi_j = 1). You can think of this as having exactly 1.01.0 (or 100%100\%) worth of probability mass to distribute among KK different buckets.

How Lagrange Multipliers Help

Lagrange multipliers provide an elegant way to deal with this budget.

By setting up the Lagrangian L=Objectiveλ×ConstraintL = \text{Objective} - \lambda \times \text{Constraint}, the parameter λ\lambda acts as an internal "price" or "exchange rate" that enforces our budget.

  • The derivative tells us that the optimal allocation is πj=Njλ\pi_j = \frac{N_j}{\lambda}.
  • This means the probability we assign to bucket jj should be directly proportional to its count NjN_j.
  • The multiplier λ\lambda physically represents the total normalizing constant (the sum of all NkN_k counts) needed to ensure the sum of πj\pi_j equals 1.
graph TD
A["Objective<br>(Maximize weighted log-probabilities)"] --> B{Budget Constraint}
B -->|Sum of probabilities = 1| C["Lagrange Multiplier Method<br>(Finds 'exchange rate' $\lambda$)"]
C --> D[Result: Proportional Allocation]
D --> E["$$\pi_j = \frac{N_j}{\text{Total } N}$$"]

Common Pitfalls

  • Ignoring the multiplier: Sometimes students will differentiate Njlogπj\sum N_j \log \pi_j directly, getting Nj/πj=0N_j/\pi_j = 0, which is unsolved or undefined. You cannot optimize constrained probabilities without accounting for the sum-to-1 bound.
  • Forgetting that λ\lambda is a shared constant: In the step where we sum Njλ=1\sum \frac{N_j}{\lambda} = 1, remember that λ\lambda has no jj subscript. It is the identical scaling factor applied to correct every probability simultaneously.