Skip to main content

Answer

Prerequisites

  • Bayes Decision Rule (BDR)
  • 0-1 Loss Function
  • Conditional Risk
  • Definition of Mode

Step-by-Step Derivation

  1. Analyze the Loss Function as q0q \rightarrow 0: The Minkowski loss is Lq(g(x),y)=g(x)yqL_q(g(x), y) = |g(x) - y|^q. Let's look at the limit as q0q \rightarrow 0:

    • If g(x)yg(x) \neq y, then g(x)y>0|g(x) - y| > 0. Any positive number raised to the power of 0 is 1. So, limq0g(x)yq=1\lim_{q \to 0} |g(x) - y|^q = 1.
    • If g(x)=yg(x) = y, then g(x)y=0|g(x) - y| = 0. 0q=00^q = 0 for any q>0q > 0. So, limq00q=0\lim_{q \to 0} |0|^q = 0.

    Therefore, as q0q \rightarrow 0, the loss function approaches the 0-1 loss function (often used in classification, but here applied to a continuous space): L0(g(x),y)={0if g(x)=y1if g(x)yL_0(g(x), y) = \begin{cases} 0 & \text{if } g(x) = y \\ 1 & \text{if } g(x) \neq y \end{cases}

    Note: In a strictly continuous setting, the probability of guessing exactly yy is zero. A more rigorous approach considers a small tolerance ϵ\epsilon around g(x)g(x), i.e., loss is 0 if g(x)y<ϵ|g(x) - y| < \epsilon and 1 otherwise, and then takes the limit as ϵ0\epsilon \rightarrow 0.

  2. Define the Conditional Risk with ϵ\epsilon-tolerance: Let's define a loss function LϵL_\epsilon: Lϵ(g(x),y)={0if g(x)yϵ1if g(x)y>ϵL_\epsilon(g(x), y) = \begin{cases} 0 & \text{if } |g(x) - y| \le \epsilon \\ 1 & \text{if } |g(x) - y| > \epsilon \end{cases}

    The conditional risk is: R(x)=Lϵ(g(x),y)p(yx)dyR(x) = \int_{-\infty}^{\infty} L_\epsilon(g(x), y) p(y|x) dy R(x)=g(x)y>ϵ1p(yx)dy+g(x)ϵg(x)+ϵ0p(yx)dyR(x) = \int_{|g(x)-y| > \epsilon} 1 \cdot p(y|x) dy + \int_{g(x)-\epsilon}^{g(x)+\epsilon} 0 \cdot p(y|x) dy R(x)=g(x)y>ϵp(yx)dyR(x) = \int_{|g(x)-y| > \epsilon} p(y|x) dy

  3. Minimize the Conditional Risk: We know that the total probability is 1: p(yx)dy=1\int_{-\infty}^{\infty} p(y|x) dy = 1 g(x)y>ϵp(yx)dy+g(x)ϵg(x)+ϵp(yx)dy=1\int_{|g(x)-y| > \epsilon} p(y|x) dy + \int_{g(x)-\epsilon}^{g(x)+\epsilon} p(y|x) dy = 1

    So, the risk can be rewritten as: R(x)=1g(x)ϵg(x)+ϵp(yx)dyR(x) = 1 - \int_{g(x)-\epsilon}^{g(x)+\epsilon} p(y|x) dy

    To minimize R(x)R(x), we must maximize the integral term: maxg(x)g(x)ϵg(x)+ϵp(yx)dy\max_{g(x)} \int_{g(x)-\epsilon}^{g(x)+\epsilon} p(y|x) dy

  4. Take the Limit as ϵ0\epsilon \rightarrow 0: For a very small ϵ\epsilon, the integral can be approximated by the width of the interval times the height of the function at the center: g(x)ϵg(x)+ϵp(yx)dy2ϵp(g(x)x)\int_{g(x)-\epsilon}^{g(x)+\epsilon} p(y|x) dy \approx 2\epsilon \cdot p(g(x)|x)

    So, we want to maximize: maxg(x)2ϵp(g(x)x)\max_{g(x)} 2\epsilon \cdot p(g(x)|x)

    Since 2ϵ2\epsilon is a positive constant, this is equivalent to maximizing the probability density function itself: g(x)=argmaxyp(yx)g^*(x) = \arg\max_{y} p(y|x)

  5. Interpret the Result: The value yy that maximizes the probability density function p(yx)p(y|x) is, by definition, the mode of the distribution. Therefore, g(x)=mode(yx)g^*(x) = \text{mode}(y|x).