Answer

Prerequisites

Expectation-Maximization (EM) Algorithm: An iterative method to find maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables.
Maximum Likelihood Estimation (MLE): A method of estimating the parameters of a statistical model given observations.
Exponential Distribution: A continuous probability distribution parameterized by a rate parameter $\lambda$ . Its Probability Density Function (PDF) is $f(x; \lambda) = \lambda e^{-\lambda x}$ .
Lagrange Multipliers: A strategy for finding the local maxima and minima of a function subject to equation constraints.

Step-by-Step Derivation

To derive the EM algorithm for the mixture of exponentials, we first introduce a latent variable $z_i \in \{1, \dots, K\}$ for each sample $x_i$ , which indicates the component from which $x_i$ was generated. We represent $z_i$ as a one-hot vector where $z_{ij} = 1$ if $x_i$ belongs to component $j$ , and $0$ otherwise.

The complete data log-likelihood for the dataset $(X, Z) = \{(x_i, z_i)\}_{i=1}^n$ is:

$\ell_c(\theta) = \sum_{i=1}^n \log p(x_i, z_i | \theta) = \sum_{i=1}^n \sum_{j=1}^K I(z_i = j) \log (\pi_j \lambda_j e^{-\lambda_j x_i})$

where $I(z_i = j)$ is an indicator function that equals 1 if $z_i = j$ and 0 otherwise.

1. E-Step (Expectation)

In the E-step, we compute the expected value of the latent variables $z_i$ given the data $X$ and the current parameter estimates $\theta^{(t)} = \{\pi_j^{(t)}, \lambda_j^{(t)}\}_{j=1}^K$ .

We define the responsibility $\gamma_{ij}$ as the posterior probability that sample $x_i$ was generated by component $j$ :

$\gamma_{ij} = E[I(z_i=j) | x_i, \theta^{(t)}] = p(z_i=j | x_i, \theta^{(t)})$

Using Bayes' theorem:

$\gamma_{ij} = \frac{p(z_i=j | \theta^{(t)}) p(x_i | z_i=j, \theta^{(t)})}{\sum_{m=1}^K p(z_i=m | \theta^{(t)}) p(x_i | z_i=m, \theta^{(t)})}$

Substituting the given distributions:

$\gamma_{ij} = \frac{\pi_j^{(t)} \lambda_j^{(t)} e^{-\lambda_j^{(t)} x_i}}{\sum_{m=1}^K \pi_m^{(t)} \lambda_m^{(t)} e^{-\lambda_m^{(t)} x_i}}$

We construct the expected complete-data log-likelihood (the Q-function), replacing the indicator function with its expectation $\gamma_{ij}$ :

$Q(\theta, \theta^{(t)}) = \sum_{i=1}^n \sum_{j=1}^K \gamma_{ij} (\log \pi_j + \log \lambda_j - \lambda_j x_i)$

2. M-Step (Maximization)

In the M-step, we maximize the Q-function with respect to the parameters $\theta = \{\pi_j, \lambda_j\}$ to find the updated parameters $\theta^{(t+1)}$ . Let $N_j = \sum_{i=1}^n \gamma_{ij}$ be the effective number of samples assigned to component $j$ .

Updating $\pi_j$ (Mixture Weights):

We want to maximize $Q$ with respect to $\pi_j$ subject to the constraint $\sum_{j=1}^K \pi_j = 1$ . We use a Lagrange multiplier $\alpha$ to enforce this constraint:

$\mathcal{L}(\pi, \alpha) = \sum_{i=1}^n \sum_{j=1}^K \gamma_{ij} \log \pi_j + \alpha \left(1 - \sum_{j=1}^K \pi_j\right)$

Taking the partial derivative with respect to $\pi_j$ and setting it to zero:

$\frac{\partial \mathcal{L}}{\partial \pi_j} = \frac{1}{\pi_j} \sum_{i=1}^n \gamma_{ij} - \alpha = 0 \implies \pi_j = \frac{N_j}{\alpha}$

Summing over all $K$ components to solve for the Lagrange multiplier $\alpha$ :

$\sum_{j=1}^K \pi_j = \sum_{j=1}^K \frac{N_j}{\alpha} = \frac{1}{\alpha} \sum_{j=1}^K \sum_{i=1}^n \gamma_{ij} = \frac{n}{\alpha} = 1 \implies \alpha = n$

Thus, the update rule for the mixture weights is:

$\pi_j^{(t+1)} = \frac{N_j}{n}$

Updating $\lambda_j$ (Component Parameters):

We take the partial derivative of the Q-function with respect to $\lambda_j$ and set it to zero:

$\frac{\partial Q}{\partial \lambda_j} = \sum_{i=1}^n \gamma_{ij} \left( \frac{1}{\lambda_j} - x_i \right) = 0$

$\frac{1}{\lambda_j} \sum_{i=1}^n \gamma_{ij} - \sum_{i=1}^n \gamma_{ij} x_i = 0$

Replacing the sum of responsibilities with $N_j$ :

$\frac{N_j}{\lambda_j} = \sum_{i=1}^n \gamma_{ij} x_i$

Rearranging for $\lambda_j$ , we get the update rule:

$\lambda_j^{(t+1)} = \frac{N_j}{\sum_{i=1}^n \gamma_{ij} x_i}$

Summary of the Derived EM Algorithm

Given initial parameters $\theta^{(0)} = \{\pi_j^{(0)}, \lambda_j^{(0)}\}_{j=1}^K$ , iterate the following steps until convergence:

E-step: Calculate the responsibilities for $i=1,\dots,n$ and $j=1,\dots,K$ : $\gamma_{ij} = \frac{\pi_j^{(t)} \lambda_j^{(t)} e^{-\lambda_j^{(t)} x_i}}{\sum_{m=1}^K \pi_m^{(t)} \lambda_m^{(t)} e^{-\lambda_m^{(t)} x_i}}$
M-step: Update the parameters using the responsibilities: $N_j = \sum_{i=1}^n \gamma_{ij}$ $\pi_j^{(t+1)} = \frac{N_j}{n}$ $\lambda_j^{(t+1)} = \frac{N_j}{\sum_{i=1}^n \gamma_{ij} x_i}$

Prerequisites​

Step-by-Step Derivation​

1. E-Step (Expectation)​

2. M-Step (Maximization)​

Summary of the Derived EM Algorithm​

Prerequisites

Step-by-Step Derivation

1. E-Step (Expectation)

2. M-Step (Maximization)

Summary of the Derived EM Algorithm