Skip to main content

Answer

Prerequisites

  • Bayes Decision Rule (BDR) for 0-1 Loss: The optimal decision rule that minimizes the probability of misclassification is to choose the class that maximizes the posterior probability: g(x)=argmaxjp(y=jx)g(x)^* = \text{argmax}_j p(y=j|x).
  • Bayes' Theorem: p(y=jx)=p(xy=j)p(y=j)p(x)p(y=j|x) = \frac{p(x|y=j)p(y=j)}{p(x)}.
  • Multivariate Gaussian Distribution: The probability density function is p(xy=j)=1(2π)d/2Σ1/2exp(12(xμj)TΣ1(xμj))p(x|y=j) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_j)^T\Sigma^{-1}(x-\mu_j)\right).
  • Log-Likelihood: Maximizing a function is equivalent to maximizing its logarithm, as the logarithm is a monotonically increasing function.

Step-by-Step Derivation

  1. Start with the BDR: To minimize the 0-1 loss, we maximize the posterior probability: g(x)=argmaxjp(y=jx)g(x)^* = \text{argmax}_j p(y=j|x)

  2. Apply Bayes' Theorem: p(y=jx)=p(xy=j)p(y=j)p(x)p(y=j|x) = \frac{p(x|y=j)p(y=j)}{p(x)} Since the evidence p(x)p(x) is the same for all classes jj, it does not affect the argmax\text{argmax} operation. We can instead maximize the joint probability: g(x)=argmaxj[p(xy=j)p(y=j)]g(x)^* = \text{argmax}_j [p(x|y=j)p(y=j)]

  3. Take the Logarithm: To simplify the exponential terms in the Gaussian distribution, we take the natural logarithm of the objective function. Let this be our discriminant function gj(x)g_j(x): gj(x)=log(p(xy=j)p(y=j))=logp(xy=j)+logp(y=j)g_j(x) = \log(p(x|y=j)p(y=j)) = \log p(x|y=j) + \log p(y=j)

  4. Substitute the Gaussian Density and Prior: Given p(xy=j)=N(xμj,Σ)p(x|y=j) = \mathcal{N}(x|\mu_j, \Sigma) and p(y=j)=πjp(y=j) = \pi_j: gj(x)=log(1(2π)d/2Σ1/2exp(12(xμj)TΣ1(xμj)))+logπjg_j(x) = \log \left( \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(x-\mu_j)^T\Sigma^{-1}(x-\mu_j)\right) \right) + \log \pi_j gj(x)=d2log(2π)12logΣ12(xμj)TΣ1(xμj)+logπjg_j(x) = -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(x-\mu_j)^T\Sigma^{-1}(x-\mu_j) + \log \pi_j

  5. Remove Class-Independent Terms: The terms d2log(2π)-\frac{d}{2}\log(2\pi) and 12logΣ-\frac{1}{2}\log|\Sigma| are constants with respect to the class index jj (because all classes share the same covariance matrix Σ\Sigma). We can drop them from the discriminant function: gj(x)=12(xμj)TΣ1(xμj)+logπjg_j(x) = -\frac{1}{2}(x-\mu_j)^T\Sigma^{-1}(x-\mu_j) + \log \pi_j

  6. Expand the Quadratic Term: (xμj)TΣ1(xμj)=xTΣ1xxTΣ1μjμjTΣ1x+μjTΣ1μj(x-\mu_j)^T\Sigma^{-1}(x-\mu_j) = x^T\Sigma^{-1}x - x^T\Sigma^{-1}\mu_j - \mu_j^T\Sigma^{-1}x + \mu_j^T\Sigma^{-1}\mu_j Since Σ\Sigma is a covariance matrix, it is symmetric, which means its inverse Σ1\Sigma^{-1} is also symmetric. Therefore, the scalar xTΣ1μjx^T\Sigma^{-1}\mu_j is equal to its transpose μjTΣ1x\mu_j^T\Sigma^{-1}x. (xμj)TΣ1(xμj)=xTΣ1x2μjTΣ1x+μjTΣ1μj(x-\mu_j)^T\Sigma^{-1}(x-\mu_j) = x^T\Sigma^{-1}x - 2\mu_j^T\Sigma^{-1}x + \mu_j^T\Sigma^{-1}\mu_j

  7. Substitute Back and Simplify: gj(x)=12(xTΣ1x2μjTΣ1x+μjTΣ1μj)+logπjg_j(x) = -\frac{1}{2} \left( x^T\Sigma^{-1}x - 2\mu_j^T\Sigma^{-1}x + \mu_j^T\Sigma^{-1}\mu_j \right) + \log \pi_j gj(x)=12xTΣ1x+μjTΣ1x12μjTΣ1μj+logπjg_j(x) = -\frac{1}{2}x^T\Sigma^{-1}x + \mu_j^T\Sigma^{-1}x - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \log \pi_j The term 12xTΣ1x-\frac{1}{2}x^T\Sigma^{-1}x is independent of jj, so we can drop it as well. The simplified discriminant function becomes: gj(x)=μjTΣ1x12μjTΣ1μj+logπjg_j(x) = \mu_j^T\Sigma^{-1}x - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \log \pi_j

  8. Formulate as a Linear Function: We can rewrite this in the form gj(x)=wjTx+bjg_j(x) = w_j^T x + b_j. Let wj=Σ1μjw_j = \Sigma^{-1}\mu_j. Then wjT=(Σ1μj)T=μjT(Σ1)T=μjTΣ1w_j^T = (\Sigma^{-1}\mu_j)^T = \mu_j^T(\Sigma^{-1})^T = \mu_j^T\Sigma^{-1}. Let bj=12μjTΣ1μj+logπjb_j = -\frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \log \pi_j. Substituting these into our equation yields: gj(x)=wjTx+bjg_j(x) = w_j^T x + b_j This completes the proof.