Skip to main content

Explanation of MLE for Covariance

The Goal

We want to find the covariance matrix Σ\Sigma that describes the "spread" and "shape" of the data cloud which maximizes the likelihood of observing the data samples {x1,,xN}\{x_1, \dots, x_N\}.

The Process

  1. Trace Trick: The expression involved in the Gaussian PDF, (xμ)TΣ1(xμ)(x-\mu)^T \Sigma^{-1} (x-\mu), is a scalar. The trace of a scalar is the scalar itself. A convenient property of trace is cyclic permutation: tr(ABC)=tr(BCA)\text{tr}(ABC) = \text{tr}(BCA). Using this, we can move the vectors xμx-\mu around to form the "outer product" (xμ)(xμ)T(x-\mu)(x-\mu)^T, which looks like a covariance matrix structure. This allows us to group all the data summation into a single matrix SS, the scatter matrix.

  2. Derivative of Determinant: The likelihood term contains logΣ\log|\Sigma|. The derivative of log(det(X))\log(\text{det}(X)) is related to the inverse of the matrix, X1X^{-1}. Intuitively, maximizing the likelihood prevents the determinant (volume of the probability density) from collapsing to zero or going to infinity without bound in relation to the exponential decay term.

  3. Derivative of Inverse Trace: The exponential term involves Σ1\Sigma^{-1}. Differentiating the inverse of a matrix function is slightly more complex, but the provided identity simplifies it. It essentially comes from the rule d(X1)=X1(dX)X1d(X^{-1}) = -X^{-1} (dX) X^{-1}.

  4. Balancing Act: The derivative equation N2Σ1+12Σ1SΣ1=0-\frac{N}{2} \Sigma^{-1} + \frac{1}{2} \Sigma^{-1} S \Sigma^{-1} = 0 represents a balance. The first term comes from the normalization constant (trying to make Σ\Sigma small to increase density), and the second term comes from the exponential "error" (trying to make Σ\Sigma large to accommodate the data spread).

  5. Result: The solution Σ^=1NS\hat{\Sigma} = \frac{1}{N} S basically says the most likely covariance shape is exactly the average empirical covariance of the data points. (Note: In standard statistics, we often divide by N1N-1 for an unbiased estimator, but the pure MLE divides by NN).