Explanation: MLE vs Least Squares
Probabilistic Interpretation
Part (a) approached the problem geometrically: "minimize the distance". Part (b) approaches the problem probabilistically: "maximize the probability of the data".
Gaussian Noise Assumption
The crucial link is the assumption of Gaussian noise. Because the Gaussian PDF involves an exponential of a squared term (), taking the log of the Gaussian PDF gives you back the squared term ().
Why Log-Likelihood?
We almost always work with Log-Likelihood in ML because:
- Numerical Stability: Multiplying many small probabilities (e.g., ) results in underflow (numbers too small for computers to represent). Summing logs (e.g., ) is stable.
- Simplified Math: Calculus is easier with sums than with products.
Conclusion
We proved that Least Squares is not heuristic; it is the statistically optimal solution IF the noise is Gaussian. If the noise followed a different distribution (e.g., Laplace), the optimal loss function would change (e.g., to Mean Absolute Error).