Skip to main content

Prerequisites

  • Expectation-Maximization (EM) Algorithm: An iterative method to find maximum likelihood estimates of parameters in statistical models, particularly when the model depends on unobserved latent variables.
  • Poisson Distribution: A discrete probability distribution expressing the probability of a given number of events occurring in a fixed interval.
  • Complete-Data Log-Likelihood: The log-likelihood function constructed by aggregating both observed variables and unobserved latent variables.

Step-by-Step Derivation

Let X={x1,,xn}X = \{x_1, \dots, x_n\} be our set of observations, where each xi{0,1,2,}x_i \in \{0, 1, 2, \dots\}. Let Z={z1,,zn}Z = \{z_1, \dots, z_n\} be the unobserved latent variables (cluster assignments), where zi{1,,K}z_i \in \{1, \dots, K\} indicates which Poisson component generated xix_i. The probabilities of the latent variables are defined by the mixing weights: p(zi=j)=πjp(z_i = j) = \pi_j, with j=1Kπj=1\sum_{j=1}^K \pi_j = 1.

The probability of an observation xix_i given it comes from component jj is: p(xizi=j,θ)=eλjλjxixi!p(x_i | z_i = j, \theta) = \frac{e^{-\lambda_j} \lambda_j^{x_i}}{x_i!}

The complete data log-likelihood for one observation (xi,zi)(x_i, z_i) is: logp(xi,zi=jθ)=log(p(zi=j)p(xizi=j,θ))=logπjλj+xilogλjlog(xi!)\log p(x_i, z_i = j | \theta) = \log(p(z_i = j) p(x_i | z_i=j, \theta)) = \log \pi_j - \lambda_j + x_i \log \lambda_j - \log(x_i!)

Hence, the total complete-data log-likelihood for all nn samples, typically written using indicator functions I(zi=j)\mathbb{I}(z_i = j), is: Lc(θ)=i=1nj=1KI(zi=j)(logπjλj+xilogλjlog(xi!))\mathcal{L}_c(\theta) = \sum_{i=1}^n \sum_{j=1}^K \mathbb{I}(z_i = j) \left( \log \pi_j - \lambda_j + x_i \log \lambda_j - \log(x_i!) \right)

1. The E-step (Expectation)

In the E-step, we compute the expected value of the complete data log-likelihood under the posterior distribution of latent variables ZZ, given the current parameter estimates θ(t)\theta^{(t)}.

This reduces to taking the expectation of the indicator variables I(zi=j)\mathbb{I}(z_i = j), which yields the posterior responsibilities γij\gamma_{ij}: γij=p(zi=jxi,θ(t))=p(zi=jθ(t))p(xizi=j,θ(t))m=1Kp(zi=mθ(t))p(xizi=m,θ(t))\gamma_{ij} = p(z_i = j | x_i, \theta^{(t)}) = \frac{p(z_i = j | \theta^{(t)}) p(x_i | z_i = j, \theta^{(t)})}{\sum_{m=1}^K p(z_i = m | \theta^{(t)}) p(x_i | z_i = m, \theta^{(t)})}

Substituting the Poisson density: γij(t)=πj(t)eλj(t)(λj(t))xixi!m=1Kπm(t)eλm(t)(λm(t))xixi!=πj(t)eλj(t)(λj(t))xim=1Kπm(t)eλm(t)(λm(t))xi\gamma_{ij}^{(t)} = \frac{\pi_j^{(t)} \frac{e^{-\lambda_j^{(t)}} (\lambda_j^{(t)})^{x_i}}{x_i!}}{\sum_{m=1}^K \pi_m^{(t)} \frac{e^{-\lambda_m^{(t)}} (\lambda_m^{(t)})^{x_i}}{x_i!}} = \frac{\pi_j^{(t)} e^{-\lambda_j^{(t)}} (\lambda_j^{(t)})^{x_i}}{\sum_{m=1}^K \pi_m^{(t)} e^{-\lambda_m^{(t)}} (\lambda_m^{(t)})^{x_i}}

We arrive at the auxiliary function Q(θ,θ(t))Q(\theta, \theta^{(t)}) to be maximized in the next step: Q(θ,θ(t))=i=1nj=1Kγij(t)(logπjλj+xilogλjlog(xi!))Q(\theta, \theta^{(t)}) = \sum_{i=1}^n \sum_{j=1}^K \gamma_{ij}^{(t)} \left( \log \pi_j - \lambda_j + x_i \log \lambda_j - \log(x_i!) \right)

2. The M-step (Maximization)

In the M-step, we maximize Q(θ,θ(t))Q(\theta, \theta^{(t)}) with respect to the parameters θ=(π,λ)\theta = (\pi, \lambda).

Maximizing for πj\pi_j (Mixing proportions): We maximize using a Lagrange multiplier to enforce the constraint j=1Kπj=1\sum_{j=1}^K \pi_j = 1: L(π,α)=i=1nj=1Kγij(t)logπj+α(1j=1Kπj)L(\pi, \alpha) = \sum_{i=1}^n \sum_{j=1}^K \gamma_{ij}^{(t)} \log \pi_j + \alpha \left(1 - \sum_{j=1}^K \pi_j \right)

Taking the derivative w.r.t πj\pi_j and equating to zero: πjL(π,α)=i=1nγij(t)πjα=0    πj=1αi=1nγij(t)\frac{\partial}{\partial \pi_j} L(\pi, \alpha) = \frac{\sum_{i=1}^n \gamma_{ij}^{(t)}}{\pi_j} - \alpha = 0 \implies \pi_j = \frac{1}{\alpha} \sum_{i=1}^n \gamma_{ij}^{(t)}

Summing over all jj, we find α=n\alpha = n. Let Nj=i=1nγij(t)N_j = \sum_{i=1}^n \gamma_{ij}^{(t)} be the expected number of samples assigned to component jj. The update rule is: πj(t+1)=Njn\pi_j^{(t+1)} = \frac{N_j}{n}

Maximizing for λj\lambda_j (Poisson rates): We separate the terms containing λj\lambda_j in the QQ function and take the derivative w.r.t λj\lambda_j: Qλj=i=1nγij(t)(1+xiλj)=0\frac{\partial Q}{\partial \lambda_j} = \sum_{i=1}^n \gamma_{ij}^{(t)} \left( -1 + \frac{x_i}{\lambda_j} \right) = 0

Rearranging the expression to solve for λj\lambda_j: i=1nγij(t)=i=1nγij(t)xiλj\sum_{i=1}^n \gamma_{ij}^{(t)} = \sum_{i=1}^n \gamma_{ij}^{(t)} \frac{x_i}{\lambda_j} Nj=1λji=1nγij(t)xi    λj(t+1)=i=1nγij(t)xiNjN_j = \frac{1}{\lambda_j} \sum_{i=1}^n \gamma_{ij}^{(t)} x_i \implies \lambda_j^{(t+1)} = \frac{\sum_{i=1}^n \gamma_{ij}^{(t)} x_i}{N_j}

Relation to MLE for a single Poisson

For a standard ML estimate of a single Poisson distribution (Problem 2.1), the update rule is λ=1ni=1nxi\lambda = \frac{1}{n} \sum_{i=1}^n x_i (the arithmetic mean of the data).

In the M-step here, the update rule λj(t+1)=i=1nγij(t)xii=1nγij(t)\lambda_j^{(t+1)} = \frac{\sum_{i=1}^n \gamma_{ij}^{(t)} x_i}{\sum_{i=1}^n \gamma_{ij}^{(t)}} is essentially a weighted ML estimate. Instead of treating every point equally (weight 1), each sample observation xix_i contributes to component jj only fractionally based on its responsibility γij\gamma_{ij}. The sum of weights NjN_j substitutes the general total component nn.