Prerequisites
- Mixture Models: Understanding how a probability density can be represented as a weighted sum of component densities.
- Exponential Distribution: The PDF p(x∣λ)=λe−λx.
- Maximum Likelihood Estimation (MLE): Finding parameter values that maximize the likelihood function.
- Expectation-Maximization (EM) Algorithm: An iterative method to find MLE or MAP estimates of parameters in statistical models, especially those with latent variables.
- Jensen's Inequality: Used in the derivation of the lower bound for the EM algorithm.
- Lagrange Multipliers: Generating constraints optimization (specifically for ∑πj=1).
Step-by-Step Derivation
1. Model Setup and Complete Data Likelihood
Let the observed data be X={x1,…,xn}.
We introduce latent variables Z={z1,…,zn}, where zi∈{1,…,K} indicates which component generated xi.
Using a 1-of-K coding scheme, let zij=1 if xi belongs to component j, and 0 otherwise.
The complete data likelihood is:
P(X,Z∣θ)=i=1∏nj=1∏K[πjp(xi∣λj)]zij
The complete data log-likelihood is:
lnP(X,Z∣θ)=i=1∑nj=1∑Kzij(lnπj+ln(λje−λjxi))
lnP(X,Z∣θ)=i=1∑nj=1∑Kzij(lnπj+lnλj−λjxi)
2. E-step (Expectation)
We calculate the expected value of the latent variables zij given the current parameters θ(t). This is the posterior probability (responsibility) that data point xi was generated by component j:
γij(t)=E[zij∣xi,θ(t)]=P(zi=j∣xi,θ(t))
Using Bayes' theorem:
γij(t)=∑k=1Kπk(t)p(xi∣λk(t))πj(t)p(xi∣λj(t))
Substituting the exponential density:
γij(t)=∑k=1Kπk(t)λk(t)e−λk(t)xiπj(t)λj(t)e−λj(t)xi
3. M-step (Maximization)
We maximize the expected complete data log-likelihood (Q-function) with respect to the parameters θ={πj,λj}.
The Q-function is:
Q(θ,θ(t))=EZ[lnP(X,Z∣θ)∣X,θ(t)]=i=1∑nj=1∑Kγij(t)(lnπj+lnλj−λjxi)
Optimization w.r.t πj
We need to maximize ∑i=1n∑j=1Kγij(t)lnπj subject to the constraint ∑j=1Kπj=1.
Using Lagrange multipliers, the Lagrangian is:
L(π,α)=i=1∑nj=1∑Kγij(t)lnπj+α(j=1∑Kπj−1)
Taking the derivative with respect to πj and setting to 0:
∂πj∂L=i=1∑nπjγij(t)+α=0⟹πj=−α∑i=1nγij(t)
Summing over j:
j=1∑Kπj=−α∑i=1n∑j=1Kγij(t)⟹1=−αn⟹α=−n
Thus:
πj(t+1)=n1i=1∑nγij(t)
Optimization w.r.t λj
We consider the terms in Q involving λj:
Qλj=i=1∑nγij(t)(lnλj−λjxi)
Taking the derivative with respect to λj and setting to 0:
∂λj∂Q=i=1∑nγij(t)(λj1−xi)=0
λj1i=1∑nγij(t)=i=1∑nγij(t)xi
λj(t+1)=∑i=1nγij(t)xi∑i=1nγij(t)
Final EM Algorithm
- Initialize: Choose initial values for πj(0) and λj(0).
- E-step: Calculate responsibilities:
γij(t)=∑k=1Kπk(t)λk(t)e−λk(t)xiπj(t)λj(t)e−λj(t)xi
- M-step: Update parameters:
πj(t+1)=n1i=1∑nγij(t)
λj(t+1)=∑i=1nγij(t)xi∑i=1nγij(t)
- Iterate: Repeat steps 2 and 3 until convergence.