Skip to main content

Answer.md

Pre-required Knowledge

  1. Multivariate Gaussian Distribution PDF: The probability density function for a dd-dimensional Gaussian distribution with mean μ\mu and covariance matrix Σ\Sigma is: p(xμ,Σ)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))p(x|\mu, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right)

  2. Likelihood Function: Assuming the samples {x1,,xN}\{x_1, \dots, x_N\} are independent and identically distributed (i.i.d.), the likelihood function is: L(μ,Σ)=i=1Np(xiμ,Σ)L(\mu, \Sigma) = \prod_{i=1}^N p(x_i|\mu, \Sigma)

  3. Log-Likelihood Function: It is usually easier to maximize the log-likelihood: (μ,Σ)=logL(μ,Σ)=i=1Nlogp(xiμ,Σ)\ell(\mu, \Sigma) = \log L(\mu, \Sigma) = \sum_{i=1}^N \log p(x_i|\mu, \Sigma)

  4. Matrix/Vector Derivatives (Given in problem):

    • xxTAx=(A+AT)x\frac{\partial}{\partial x} x^T A x = (A + A^T)x. Since Σ1\Sigma^{-1} is symmetric, xxTΣ1x=2Σ1x\frac{\partial}{\partial x} x^T \Sigma^{-1} x = 2\Sigma^{-1}x.

Step-by-Step Answer

  1. Write down the Log-Likelihood Function:

    (μ,Σ)=i=1N[d2log(2π)12logΣ12(xiμ)TΣ1(xiμ)]=Nd2log(2π)N2logΣ12i=1N(xiμ)TΣ1(xiμ)\begin{aligned} \ell(\mu, \Sigma) &= \sum_{i=1}^N \left[ -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \right] \\ &= -\frac{Nd}{2}\log(2\pi) - \frac{N}{2}\log|\Sigma| - \frac{1}{2} \sum_{i=1}^N (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) \end{aligned}
  2. Differentiate with respect to μ\mu: We want to maximize (μ,Σ)\ell(\mu, \Sigma) w.r.t μ\mu. We can ignore terms that do not depend on μ\mu. Let J(μ)=12i=1N(xiμ)TΣ1(xiμ)J(\mu) = - \frac{1}{2} \sum_{i=1}^N (x_i - \mu)^T \Sigma^{-1} (x_i - \mu). Using the chain rule and the derivative zzTAz=(A+AT)z\frac{\partial}{\partial z} z^T A z = (A + A^T)z: Let zi=xiμz_i = x_i - \mu. Then ziμ=I\frac{\partial z_i}{\partial \mu} = -I. The term is of the form ziTΣ1ziz_i^T \Sigma^{-1} z_i. Since Σ\Sigma is symmetric, Σ1\Sigma^{-1} is symmetric. The derivative w.r.t ziz_i is 2Σ1zi2\Sigma^{-1}z_i. So, μ(xiμ)TΣ1(xiμ)=2Σ1(xiμ)(1)=2Σ1(xiμ)\frac{\partial}{\partial \mu} (x_i - \mu)^T \Sigma^{-1} (x_i - \mu) = 2\Sigma^{-1}(x_i - \mu) \cdot (-1) = -2\Sigma^{-1}(x_i - \mu).

    Therefore:

    μ=12i=1N[2Σ1(xiμ)]=i=1NΣ1(xiμ)\frac{\partial \ell}{\partial \mu} = -\frac{1}{2} \sum_{i=1}^N \left[ -2\Sigma^{-1}(x_i - \mu) \right] = \sum_{i=1}^N \Sigma^{-1}(x_i - \mu)
  3. Set the derivative to zero and solve for μ^\hat{\mu}:

    i=1NΣ1(xiμ^)=0Σ1i=1N(xiμ^)=0\begin{aligned} \sum_{i=1}^N \Sigma^{-1}(x_i - \hat{\mu}) &= 0 \\ \Sigma^{-1} \sum_{i=1}^N (x_i - \hat{\mu}) &= 0 \end{aligned}

    Assuming Σ\Sigma is positive definite (invertible), we can multiply by Σ\Sigma on the left:

    i=1N(xiμ^)=0i=1Nxii=1Nμ^=0i=1NxiNμ^=0Nμ^=i=1Nxiμ^=1Ni=1Nxi\begin{aligned} \sum_{i=1}^N (x_i - \hat{\mu}) &= 0 \\ \sum_{i=1}^N x_i - \sum_{i=1}^N \hat{\mu} &= 0 \\ \sum_{i=1}^N x_i - N\hat{\mu} &= 0 \\ N\hat{\mu} &= \sum_{i=1}^N x_i \\ \hat{\mu} &= \frac{1}{N} \sum_{i=1}^N x_i \end{aligned}

    Ideally, this is the sample mean.