-
Analyze the Loss Function as q→0:
The Minkowski loss is Lq(g(x),y)=∣g(x)−y∣q.
Let's look at the limit as q→0:
- If g(x)=y, then ∣g(x)−y∣>0. Any positive number raised to the power of 0 is 1. So, limq→0∣g(x)−y∣q=1.
- If g(x)=y, then ∣g(x)−y∣=0. 0q=0 for any q>0. So, limq→0∣0∣q=0.
Therefore, as q→0, the loss function approaches the 0-1 loss function (often used in classification, but here applied to a continuous space):
L0(g(x),y)={01if g(x)=yif g(x)=y
Note: In a strictly continuous setting, the probability of guessing exactly y is zero. A more rigorous approach considers a small tolerance ϵ around g(x), i.e., loss is 0 if ∣g(x)−y∣<ϵ and 1 otherwise, and then takes the limit as ϵ→0.
-
Define the Conditional Risk with ϵ-tolerance:
Let's define a loss function Lϵ:
Lϵ(g(x),y)={01if ∣g(x)−y∣≤ϵif ∣g(x)−y∣>ϵ
The conditional risk is:
R(x)=∫−∞∞Lϵ(g(x),y)p(y∣x)dy
R(x)=∫∣g(x)−y∣>ϵ1⋅p(y∣x)dy+∫g(x)−ϵg(x)+ϵ0⋅p(y∣x)dy
R(x)=∫∣g(x)−y∣>ϵp(y∣x)dy
-
Minimize the Conditional Risk:
We know that the total probability is 1:
∫−∞∞p(y∣x)dy=1
∫∣g(x)−y∣>ϵp(y∣x)dy+∫g(x)−ϵg(x)+ϵp(y∣x)dy=1
So, the risk can be rewritten as:
R(x)=1−∫g(x)−ϵg(x)+ϵp(y∣x)dy
To minimize R(x), we must maximize the integral term:
maxg(x)∫g(x)−ϵg(x)+ϵp(y∣x)dy
-
Take the Limit as ϵ→0:
For a very small ϵ, the integral can be approximated by the width of the interval times the height of the function at the center:
∫g(x)−ϵg(x)+ϵp(y∣x)dy≈2ϵ⋅p(g(x)∣x)
So, we want to maximize:
maxg(x)2ϵ⋅p(g(x)∣x)
Since 2ϵ is a positive constant, this is equivalent to maximizing the probability density function itself:
g∗(x)=argmaxyp(y∣x)
-
Interpret the Result:
The value y that maximizes the probability density function p(y∣x) is, by definition, the mode of the distribution.
Therefore, g∗(x)=mode(y∣x).