Skip to main content

Explain

Explanation

Ideally, we would just maximize Njlogπj\sum N_j \log \pi_j independently for each πj\pi_j. However, we have a constraint. We are modeling probabilities, so the sum of all πj\pi_j must be exactly 1. If we simply increased one πj\pi_j to maximize the log-likelihood without penalty, we might break this rule (e.g., they might sum to more than 1).

The Lagrange Multiplier method introduces a new variable λ\lambda (lambda) to enforce this "sum-to-1" rule.

  1. Recall the formula: The Log Probability of a multinomial distribution involves terms like Njlogπj\sum N_j \log \pi_j.
  2. The Gradient: We want to follow the slope (gradient) of this function to find the top (maximum).
  3. The Penalty: The term λ(πj1)\lambda (\sum \pi_j - 1) acts as a balancing force.
    • When we take the derivative w.r.t πj\pi_j, we get Nj/πj+λ=0N_j/\pi_j + \lambda = 0.
    • This implies πj\pi_j is proportional to NjN_j (specifically πj=Nj/λ\pi_j = -N_j / \lambda).
  4. Normalization: Since all πj\pi_j must sum to 1, and each πj\pi_j is proportional to NjN_j, the constant of proportionality must ensure the sum is 1.
    • Sum of parts = Sum of NjN_j.
    • Therefore, each part πj\pi_j is simply its observed count NjN_j divided by the total count Nk\sum N_k.

This result is intuitive: the Maximum Likelihood Estimate (MLE) for the probability of a category is just the fraction of times that category was observed (NjN_j) out of the total observations (Nk\sum N_k).