Skip to main content

Answer

Pre-required Knowledge

  • Kernel Density Estimator (KDE): The KDE is defined as: p^(x)=1ni=1n1hdk(xxih)\hat{p}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h^d} k\left(\frac{x - x_i}{h}\right) Let k~(u)=1hdk(u/h)\tilde{k}(u) = \frac{1}{h^d} k(u/h), then p^(x)=1ni=1nk~(xxi)\hat{p}(x) = \frac{1}{n} \sum_{i=1}^n \tilde{k}(x - x_i).

  • Expectation of Sum: E[Yi]=E[Yi]\mathbb{E}[\sum Y_i] = \sum \mathbb{E}[Y_i].

  • Convolution: (fg)(x)=f(μ)g(xμ)dμ(f * g)(x) = \int f(\mu) g(x - \mu) d\mu.

Step-by-Step Answer

  1. Write down the expectation of the estimator: Since xix_i are independent and identically distributed (i.i.d.) samples from p(x)p(x), and expectation is linear:

    EX[p^(x)]=E[1ni=1nk~(xxi)]=1ni=1nE[k~(xxi)]\begin{aligned} \mathbb{E}_X [\hat{p}(x)] &= \mathbb{E} \left[ \frac{1}{n} \sum_{i=1}^n \tilde{k}(x - x_i) \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E} [\tilde{k}(x - x_i)] \end{aligned}
  2. Simplify using identical distribution: Since all xix_i follow the same distribution p(x)p(x), E[k~(xxi)]\mathbb{E} [\tilde{k}(x - x_i)] is the same for all ii.

    EX[p^(x)]=E[k~(xx1)]\mathbb{E}_X [\hat{p}(x)] = \mathbb{E} [\tilde{k}(x - x_1)]
  3. Calculate the expectation: By definition of expectation for a function of a continuous random variable x1p(μ)x_1 \sim p(\mu):

    E[k~(xx1)]=k~(xμ)p(μ)dμ\mathbb{E} [\tilde{k}(x - x_1)] = \int \tilde{k}(x - \mu) p(\mu) d\mu
  4. Relate to convolution: The integral p(μ)k~(xμ)dμ\int p(\mu) \tilde{k}(x - \mu) d\mu is exactly the definition of the convolution between pp and k~\tilde{k}, denoted as p(x)k~(x)p(x) * \tilde{k}(x).

    EX[p^(x)]=p(x)k~(x)\mathbb{E}_X [\hat{p}(x)] = p(x) * \tilde{k}(x)
  5. Interpretation of Bias: The expected value of the KDE is not the true density p(x)p(x), but the true density convolved (smoothed) with the kernel function. Bias[p^(x)]=E[p^(x)]p(x)=(pk~)(x)p(x)\text{Bias}[\hat{p}(x)] = \mathbb{E}[\hat{p}(x)] - p(x) = (p * \tilde{k})(x) - p(x) This means the KDE is a biased estimator. The convolution operation "smears" or smooths out the probability mass of p(x)p(x), typically reducing peaks and filling in valleys. The bias depends on the bandwidth hh; as h0h \to 0, the kernel approaches a Dirac delta function, and the bias approaches 0 (asymptotically unbiased).