Monday, February 4, 2008

Conditional means, medians and all that

Let D be a distribution over X x Y, where X is some feature space and Y is a real-valued label. It is well known that the minimizer of the squared loss is the conditional mean mean(x) = E[y], where the expectation is over the draw of y from D|x. It's your homework problem number 3. (For those of you who are overly concerned with rigor, assume that X and Y are closed subsets of R^m and R, respectively. This is clearly true for any machine representation of X and Y.)

In many situations, the mean is not sufficiently robust, and one is actually interested in the conditional median (see this book). The conditional median is formally defined by median(x) = {q in R : D(Y ≤ q | X = x) ≥ 1/2 and D (Y ≥ q | X = x) ≥ 1/2}. The median is also the solution to the problem of minimizing E|f(x)-y|, where the expectation is with respect to D.
Note that the median may not be unique when the conditional distribution has regions with zero mass.

The argument goes like this: Fix x. Consider two equal point masses (with respect to D|x) at locations y1 and y2 in R. The loss suffered by any prediction y in the interval [y1,y2] is (y-y1)+(y2-y) = (y2-y1). Any y not in [y1,y2] always induces a larger loss, so the minimizer is in [y1,y2]. Since we can break D|x and break it into equal mass pairs with y2 above and y1 below the median (by the definition of the median), the absolute error loss is minimized when f(x) = median(x). A similar argument holds for discrete random variables (special case).