Explanation of MLE for Mean
The Goal
We want to find the value of the mean vector that makes the observed data samples most probable. This is the essence of Maximum Likelihood Estimation (MLE).
The Process
- Formulate the Likelihood: We start with the probability density function (PDF) of a single data point. Since we assume independent samples, the joint probability (Likelihood) is the product of individual probabilities.
- Log-Trick: Multiplying many probabilities (which are small numbers < 1) is difficult and numerically unstable. Taking the natural logarithm turns the product into a sum. Since is a strictly increasing function, maximizing the log-likelihood is equivalent to maximizing the likelihood.
- Focus on : We inspect the log-likelihood equation. To find the maximum with respect to , we look for the "peak" of the function. Calculus tells us this peak occurs where the gradient (derivative) is zero.
- Differentiation: We differentiate the log-likelihood function with respect to the vector . The key term involves a quadratic form , which represents the Multi-variate "distance" (Mahalanobis distance) of point from the mean . The derivative essentially behaves like the derivative of in 1D, which is . In matrix calculus, the covariance matrix acts as a weighting factor.
- Solving: Setting the derivative to zero gives us a linear equation. We find that the optimal is simply the arithmetic average of all data points. This matches our intuition: the best estimate for the center of a Gaussian cloud of points is the average of those points.