-
Start with the BDR:
To minimize the 0-1 loss, we maximize the posterior probability:
g(x)∗=argmaxjp(y=j∣x)
-
Apply Bayes' Theorem:
p(y=j∣x)=p(x)p(x∣y=j)p(y=j)
Since the evidence p(x) is the same for all classes j, it does not affect the argmax operation. We can instead maximize the joint probability:
g(x)∗=argmaxj[p(x∣y=j)p(y=j)]
-
Take the Logarithm:
To simplify the exponential terms in the Gaussian distribution, we take the natural logarithm of the objective function. Let this be our discriminant function gj(x):
gj(x)=log(p(x∣y=j)p(y=j))=logp(x∣y=j)+logp(y=j)
-
Substitute the Gaussian Density and Prior:
Given p(x∣y=j)=N(x∣μj,Σ) and p(y=j)=πj:
gj(x)=log((2π)d/2∣Σ∣1/21exp(−21(x−μj)TΣ−1(x−μj)))+logπj
gj(x)=−2dlog(2π)−21log∣Σ∣−21(x−μj)TΣ−1(x−μj)+logπj
-
Remove Class-Independent Terms:
The terms −2dlog(2π) and −21log∣Σ∣ are constants with respect to the class index j (because all classes share the same covariance matrix Σ). We can drop them from the discriminant function:
gj(x)=−21(x−μj)TΣ−1(x−μj)+logπj
-
Expand the Quadratic Term:
(x−μj)TΣ−1(x−μj)=xTΣ−1x−xTΣ−1μj−μjTΣ−1x+μjTΣ−1μj
Since Σ is a covariance matrix, it is symmetric, which means its inverse Σ−1 is also symmetric. Therefore, the scalar xTΣ−1μj is equal to its transpose μjTΣ−1x.
(x−μj)TΣ−1(x−μj)=xTΣ−1x−2μjTΣ−1x+μjTΣ−1μj
-
Substitute Back and Simplify:
gj(x)=−21(xTΣ−1x−2μjTΣ−1x+μjTΣ−1μj)+logπj
gj(x)=−21xTΣ−1x+μjTΣ−1x−21μjTΣ−1μj+logπj
The term −21xTΣ−1x is independent of j, so we can drop it as well. The simplified discriminant function becomes:
gj(x)=μjTΣ−1x−21μjTΣ−1μj+logπj
-
Formulate as a Linear Function:
We can rewrite this in the form gj(x)=wjTx+bj.
Let wj=Σ−1μj. Then wjT=(Σ−1μj)T=μjT(Σ−1)T=μjTΣ−1.
Let bj=−21μjTΣ−1μj+logπj.
Substituting these into our equation yields:
gj(x)=wjTx+bj
This completes the proof.