Derivation of the Brier scoring rule from basic requirements


The Brier score is a way of quantifying the accuracy of probabilistic forecasts of binary events, for example the statement that there is an 80% chance that it will rain in a certain area tomorrow. The Brier score for n predictions and events is defined as

where pi is the predicted probability that an event will occur, and yi is 1 if the event occurs and 0 otherwise.

The Brier score can be viewed as a loss function with the pi being the decisions and the yi the outcomes, and is in fact a form of squared error loss. While the squared error loss is a natural choice to a statistician, I wondered if there might be other reasonable choices for scoring this sort of prediction, or if a small set of desirable properties of a scoring function yields the Brier score as the unique solution. It turns out that three reasonable requirements on a loss function l(yx) are enough to specify the Brier score up to a multiplicative constant:

  1. Integrity: if the decision-maker believes that an event has a probability p of occurring, then p should be the unique prediction that minimizes the risk r(xp), or expected loss E[l(Yx)].
  2. Symmetry: l(0; x) = l(1; 1-x).
  3. No penalty for perfect predictions: l(1; 1) = 0 and l(0, 0) = 0 (it turns out one of these implies the other given points 1 and 2).

We start with the integrity requirement. Because the outcome is binary, the risk when the decision-maker believes that an event has probability p of occurring is r(xp) = (1-pl(0; x) + p l(1; x), we can find all the minima by finding the points x at which the first derivative with respect to xr’(xp) = (1-pl_x(0; x) + p l_x(1; x), is equal to zero, and the second derivative is non-positive.

By the symmetry requirement, r’(xp) = (1-plx(0; x) + p lx(1; x) = 0 can be rewritten as r’(xp) = (1-plx(1; 1-x) + p lx(1; x) = 0, and we have (1-plx(0; x) = -p lx(1; x). Thus lx(1; 1-x) = -p and lx(1; x) = c (1-p). Since the second derivative must be non-positive, c must be non-negative. Since the minimizer is to be unique, c cannot be zero. A similar argument yields the requirements that  lx(0; x) = -p and lx(0; 1-x) = c (1-p). Since p is to be the risk minimizer, we have that l(y, p) = c (y - p)^2 + k for all p. The lack of penalty for perfect predictions implies that k = 0.

Thus we the Brier score up to a multiplicative constant. Such a factor doesn’t change any meaningful properties of the loss function, so this isn’t really a problem. Why not set it to 1?

Note: additivity is a standard property of loss functions, so I didn’t include it as a separate requirement for this particular case.