Skip to main content

Answer: MLE of the mean μ\mu

Prerequisites

  • Multivariate Gaussian Distribution (PDF)
  • Maximum Likelihood Estimation (MLE)
  • Matrix Calculus

Step-by-Step Derivation

1. Write the Likelihood Function The probability density function for a single sample xiRdx_i \in \mathbb{R}^d from a multivariate Gaussian is:

p(xiμ,Σ)=1(2π)d/2Σ1/2exp(12(xiμ)TΣ1(xiμ))p(x_i | \mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \right)

Assuming the samples {x1,,xN}\{x_1, \cdots, x_N\} are independent and identically distributed (i.i.d.), the likelihood function L(μ,Σ)L(\mu, \Sigma) is the product of individual probabilities:

L(μ,Σ)=i=1Np(xiμ,Σ)L(\mu, \Sigma) = \prod_{i=1}^N p(x_i | \mu, \Sigma)

2. Formulate the Log-Likelihood To simplify the derivative, we take the natural logarithm of the likelihood function to get the log-likelihood (μ,Σ)\ell(\mu, \Sigma):

(μ,Σ)=logL(μ,Σ)=i=1Nlogp(xiμ,Σ)\ell(\mu, \Sigma) = \log L(\mu, \Sigma) = \sum_{i=1}^N \log p(x_i | \mu, \Sigma) (μ,Σ)=i=1N(d2log(2π)12logΣ12(xiμ)TΣ1(xiμ))\ell(\mu, \Sigma) = \sum_{i=1}^N \left( -\frac{d}{2} \log(2\pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \right)

Dropping the terms that do not depend on μ\mu, the objective functioning relative to μ\mu is:

J(μ)=12i=1N(xiμ)TΣ1(xiμ)J(\mu) = -\frac{1}{2} \sum_{i=1}^N (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)

3. Expand the Quadratic Term Let's expand the term (xiμ)TΣ1(xiμ)(x_i - \mu)^T \Sigma^{-1} (x_i - \mu):

(xiμ)TΣ1(xiμ)=xiTΣ1xixiTΣ1μμTΣ1xi+μTΣ1μ(x_i - \mu)^T \Sigma^{-1} (x_i - \mu) = x_i^T \Sigma^{-1} x_i - x_i^T \Sigma^{-1} \mu - \mu^T \Sigma^{-1} x_i + \mu^T \Sigma^{-1} \mu

Since Σ\Sigma is symmetric (Σ=ΣT\Sigma = \Sigma^T), its inverse Σ1\Sigma^{-1} is also symmetric. Thus, the inner product is a scalar and xiTΣ1μ=(μTΣ1xi)T=μTΣ1xix_i^T \Sigma^{-1} \mu = (\mu^T \Sigma^{-1} x_i)^T = \mu^T \Sigma^{-1} x_i:

(xiμ)TΣ1(xiμ)=xiTΣ1xi2μTΣ1xi+μTΣ1μ(x_i - \mu)^T \Sigma^{-1} (x_i - \mu) = x_i^T \Sigma^{-1} x_i - 2 \mu^T \Sigma^{-1} x_i + \mu^T \Sigma^{-1} \mu

4. Compute the Derivative with respect to μ\mu Taking the partial derivative of J(μ)J(\mu) with respect to μ\mu:

μJ(μ)=12i=1Nμ(xiTΣ1xi2(Σ1xi)Tμ+μTΣ1μ)\frac{\partial}{\partial \mu} J(\mu) = -\frac{1}{2} \sum_{i=1}^N \frac{\partial}{\partial \mu} \left( x_i^T \Sigma^{-1} x_i - 2 (\Sigma^{-1} x_i)^T \mu + \mu^T \Sigma^{-1} \mu \right)

Using the hint identities:

  • μxiTΣ1xi=0\frac{\partial}{\partial \mu} x_i^T \Sigma^{-1} x_i = 0 (constant w.r.t μ\mu)
  • μ(2(Σ1xi)Tμ)=2Σ1xi\frac{\partial}{\partial \mu} \left( -2 (\Sigma^{-1} x_i)^T \mu \right) = -2 \Sigma^{-1} x_i
  • μ(μTΣ1μ)=Σ1μ+(Σ1)Tμ=2Σ1μ\frac{\partial}{\partial \mu} (\mu^T \Sigma^{-1} \mu) = \Sigma^{-1} \mu + (\Sigma^{-1})^T \mu = 2 \Sigma^{-1} \mu (since Σ1\Sigma^{-1} is symmetric)

Plugging these back into the sum:

μ=12i=1N(2Σ1xi+2Σ1μ)=i=1NΣ1(xiμ)\frac{\partial \ell}{\partial \mu} = -\frac{1}{2} \sum_{i=1}^N \left( -2 \Sigma^{-1} x_i + 2 \Sigma^{-1} \mu \right) = \sum_{i=1}^N \Sigma^{-1} (x_i - \mu)

5. Set Derivative to Zero and Solve for μ^\hat{\mu} To find the maximum, set the derivative equal to the zero vector:

i=1NΣ1(xiμ^)=0\sum_{i=1}^N \Sigma^{-1} (x_i - \hat{\mu}) = 0

Since Σ1\Sigma^{-1} is a constant matrix (and invertible), we can multiply both sides by Σ\Sigma:

i=1N(xiμ^)=0\sum_{i=1}^N (x_i - \hat{\mu}) = 0 i=1NxiNμ^=0    Nμ^=i=1Nxi\sum_{i=1}^N x_i - N \hat{\mu} = 0 \implies N \hat{\mu} = \sum_{i=1}^N x_i μ^ML=1Ni=1Nxi\hat{\mu}_{ML} = \frac{1}{N} \sum_{i=1}^N x_i

This proves that the Maximum Likelihood Estimate for the mean is simply the sample mean.