Skip to main content

Answer

Prerequisites

  • Expectation-Maximization (EM) Algorithm: An iterative method to find maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables.
  • Maximum Likelihood Estimation (MLE): A method of estimating the parameters of a statistical model given observations.
  • Exponential Distribution: A continuous probability distribution parameterized by a rate parameter λ\lambda. Its Probability Density Function (PDF) is f(x;λ)=λeλxf(x; \lambda) = \lambda e^{-\lambda x}.
  • Lagrange Multipliers: A strategy for finding the local maxima and minima of a function subject to equation constraints.

Step-by-Step Derivation

To derive the EM algorithm for the mixture of exponentials, we first introduce a latent variable zi{1,,K}z_i \in \{1, \dots, K\} for each sample xix_i, which indicates the component from which xix_i was generated. We represent ziz_i as a one-hot vector where zij=1z_{ij} = 1 if xix_i belongs to component jj, and 00 otherwise.

The complete data log-likelihood for the dataset (X,Z)={(xi,zi)}i=1n(X, Z) = \{(x_i, z_i)\}_{i=1}^n is:

c(θ)=i=1nlogp(xi,ziθ)=i=1nj=1KI(zi=j)log(πjλjeλjxi)\ell_c(\theta) = \sum_{i=1}^n \log p(x_i, z_i | \theta) = \sum_{i=1}^n \sum_{j=1}^K I(z_i = j) \log (\pi_j \lambda_j e^{-\lambda_j x_i})

where I(zi=j)I(z_i = j) is an indicator function that equals 1 if zi=jz_i = j and 0 otherwise.

1. E-Step (Expectation)

In the E-step, we compute the expected value of the latent variables ziz_i given the data XX and the current parameter estimates θ(t)={πj(t),λj(t)}j=1K\theta^{(t)} = \{\pi_j^{(t)}, \lambda_j^{(t)}\}_{j=1}^K.

We define the responsibility γij\gamma_{ij} as the posterior probability that sample xix_i was generated by component jj:

γij=E[I(zi=j)xi,θ(t)]=p(zi=jxi,θ(t))\gamma_{ij} = E[I(z_i=j) | x_i, \theta^{(t)}] = p(z_i=j | x_i, \theta^{(t)})

Using Bayes' theorem:

γij=p(zi=jθ(t))p(xizi=j,θ(t))m=1Kp(zi=mθ(t))p(xizi=m,θ(t))\gamma_{ij} = \frac{p(z_i=j | \theta^{(t)}) p(x_i | z_i=j, \theta^{(t)})}{\sum_{m=1}^K p(z_i=m | \theta^{(t)}) p(x_i | z_i=m, \theta^{(t)})}

Substituting the given distributions:

γij=πj(t)λj(t)eλj(t)xim=1Kπm(t)λm(t)eλm(t)xi\gamma_{ij} = \frac{\pi_j^{(t)} \lambda_j^{(t)} e^{-\lambda_j^{(t)} x_i}}{\sum_{m=1}^K \pi_m^{(t)} \lambda_m^{(t)} e^{-\lambda_m^{(t)} x_i}}

We construct the expected complete-data log-likelihood (the Q-function), replacing the indicator function with its expectation γij\gamma_{ij}:

Q(θ,θ(t))=i=1nj=1Kγij(logπj+logλjλjxi)Q(\theta, \theta^{(t)}) = \sum_{i=1}^n \sum_{j=1}^K \gamma_{ij} (\log \pi_j + \log \lambda_j - \lambda_j x_i)

2. M-Step (Maximization)

In the M-step, we maximize the Q-function with respect to the parameters θ={πj,λj}\theta = \{\pi_j, \lambda_j\} to find the updated parameters θ(t+1)\theta^{(t+1)}. Let Nj=i=1nγijN_j = \sum_{i=1}^n \gamma_{ij} be the effective number of samples assigned to component jj.

Updating πj\pi_j (Mixture Weights):

We want to maximize QQ with respect to πj\pi_j subject to the constraint j=1Kπj=1\sum_{j=1}^K \pi_j = 1. We use a Lagrange multiplier α\alpha to enforce this constraint:

L(π,α)=i=1nj=1Kγijlogπj+α(1j=1Kπj)\mathcal{L}(\pi, \alpha) = \sum_{i=1}^n \sum_{j=1}^K \gamma_{ij} \log \pi_j + \alpha \left(1 - \sum_{j=1}^K \pi_j\right)

Taking the partial derivative with respect to πj\pi_j and setting it to zero:

Lπj=1πji=1nγijα=0    πj=Njα\frac{\partial \mathcal{L}}{\partial \pi_j} = \frac{1}{\pi_j} \sum_{i=1}^n \gamma_{ij} - \alpha = 0 \implies \pi_j = \frac{N_j}{\alpha}

Summing over all KK components to solve for the Lagrange multiplier α\alpha:

j=1Kπj=j=1KNjα=1αj=1Ki=1nγij=nα=1    α=n\sum_{j=1}^K \pi_j = \sum_{j=1}^K \frac{N_j}{\alpha} = \frac{1}{\alpha} \sum_{j=1}^K \sum_{i=1}^n \gamma_{ij} = \frac{n}{\alpha} = 1 \implies \alpha = n

Thus, the update rule for the mixture weights is:

πj(t+1)=Njn\pi_j^{(t+1)} = \frac{N_j}{n}

Updating λj\lambda_j (Component Parameters):

We take the partial derivative of the Q-function with respect to λj\lambda_j and set it to zero:

Qλj=i=1nγij(1λjxi)=0\frac{\partial Q}{\partial \lambda_j} = \sum_{i=1}^n \gamma_{ij} \left( \frac{1}{\lambda_j} - x_i \right) = 0

1λji=1nγiji=1nγijxi=0\frac{1}{\lambda_j} \sum_{i=1}^n \gamma_{ij} - \sum_{i=1}^n \gamma_{ij} x_i = 0

Replacing the sum of responsibilities with NjN_j:

Njλj=i=1nγijxi\frac{N_j}{\lambda_j} = \sum_{i=1}^n \gamma_{ij} x_i

Rearranging for λj\lambda_j, we get the update rule:

λj(t+1)=Nji=1nγijxi\lambda_j^{(t+1)} = \frac{N_j}{\sum_{i=1}^n \gamma_{ij} x_i}

Summary of the Derived EM Algorithm

Given initial parameters θ(0)={πj(0),λj(0)}j=1K\theta^{(0)} = \{\pi_j^{(0)}, \lambda_j^{(0)}\}_{j=1}^K, iterate the following steps until convergence:

  1. E-step: Calculate the responsibilities for i=1,,ni=1,\dots,n and j=1,,Kj=1,\dots,K: γij=πj(t)λj(t)eλj(t)xim=1Kπm(t)λm(t)eλm(t)xi\gamma_{ij} = \frac{\pi_j^{(t)} \lambda_j^{(t)} e^{-\lambda_j^{(t)} x_i}}{\sum_{m=1}^K \pi_m^{(t)} \lambda_m^{(t)} e^{-\lambda_m^{(t)} x_i}}

  2. M-step: Update the parameters using the responsibilities: Nj=i=1nγijN_j = \sum_{i=1}^n \gamma_{ij} πj(t+1)=Njn\pi_j^{(t+1)} = \frac{N_j}{n} λj(t+1)=Nji=1nγijxi\lambda_j^{(t+1)} = \frac{N_j}{\sum_{i=1}^n \gamma_{ij} x_i}