Numerically stable algorithm for computing the running mean

Published on 2021-04-16.

In computational statistics a major role is played by algorithms for calculating variance and (to a less degree) mean. Since these algorithms deal with floating-point arithmetic, it is important that they are numerically stable while also avoiding overflows. In this article we'll take a look at Welford's algorithm for computing the running variance and mean with a focus on its numerical properties.

Introduction

This blog post was inspired by the following message in ##C on freenode:

<mikoto-chan> Hello
<mikoto-chan> I'm reading a book on C and there is a really fishy piece of code that I don't understand
<mikoto-chan> https://dpaste.com/G2L5DN4AG
<mikoto-chan> I need to find out how to make "Naive avg" and "Average" differ without making it overflow
<mikoto-chan> basically, I need to demonstrate that "Average" is indeed better than "Naive avg"

Which references exercise 15 in chapter 1 of A book on C.

15 (Suggested to us by Donald Knuth at Stanford University.) In the running_sum program in Section 1.6, "Flow of Control," on page 26, we first computed a sum and then divided by the number of summands to compute an average. The following program illustrates a better way to compute the average:
/* Compute a better average. */

#include<stdio.h>

int main(void)
{
int     i;
double  x;
double  avg    =  0.0; /* a better average */
double  navg;          /* a naive average */
double  sum    =  0.0;

printf("%5s%17s%17s%17s\n%5s%17s%17s%17s\n\n",
"Count", "Item", "Average", "Naive avg",
"-----", "----", "-------", "------");

for (i = 1; scanf("%lf", &x) == 1; ++i)
{
avg += (x - avg) / i;
sum += x;
navg = sum / i;
printf("%5d%17e%17e%17e\n", i, x, avg, navg);
}
return 0;
}

Run this program so that you understand its effects. Note that the better algorithm for computing the average is embodied in the line

avg += (x - avg) / i;

Explain why this algorithm does, in fact, compute the running average. Hint: Do some simple hand calculations first.

The explaination as to why this algorithm computes the running average is left as an exercise to the reader :).

Welford's algorithm for the running mean and variance

This method of computing the running mean dates back to a 1962 paper by B. P. Welford [1] and is presented in Donald Knuth’s Art of Computer Programming, Vol 2 which introduces an algorithm for computing the sample variance in one pass:

\begin{align} M_1 &= x_1, & M_k &= M_{k-1} \oplus (x_k \ominus M_{k-1}) \oslash k \\ S_1 &= 0, & S_k &= S_{k-1} \oplus (x_k \ominus M_{k-1}) \otimes (x_k \ominus M_k) \end{align} for $$2 \leq k \leq n$$, the $$k$$-th estimate of the variance is $$s^2 = S_k/(k–1)$$ and the standard deviation is $$\sigma = \sqrt{S_n / (n-1)}$$. Where $$\oplus$$, $$\ominus$$, $$\otimes$$ and $$\oslash$$ are the floating-point approximation of their respective operators.

The algorithm works by keeping an estimate of the sample mean at the $$k$$-th instance, $$M_k$$, which is used to update the auxiliary second order statistics $$S_k$$. This auxiliary value is used, in turn, to calculate the variance. At the beginning of the data monitoring process, we set $$M_1$$ to $$x_1$$ and $$S_1$$ to zero. After each new observation arrives, we update the stored statistics, as seen above.

To answer mikoto-chan's question (demonstrate that "Avarage" is better than "Naive avg"), we first need to understand in what way the proposed average is better than the naive one: their difference lies in their numerical stability and the presence of overflows.

Numerical stability

The proposed way of computing the running average is more numerically stable. In this case numerical stability refers to a potential loss of precision: in the second method, the intermediate result (sum) will tend to grow without bound, which means we will eventually lose low-end precision, which means that if the average of the numbers we are summing is much larger than the distance of each number from the average, the naive approach will lose mantissa bits. In the first method since we are looking at the relative values, the intermediate result should stay of a roughly similar magnitude from the input data meaning that it will retain precision better.

To better explain the problem consider the representation of a floating-point: a sign, mantissa and an exponent. The mantissa represents the precision, i.e. the significant digits, and there are a fixed number of them (DBL_MANT_DIG, typically 53).

NOTE:
This view is a bit simplified, for more details see IEEE 754 and this paper by Goldberg [2].

As the numbers grow larger, the exponent (between DBL_MIN_EXP, and DBL_MAX_EXP which are respectively -1021 and 1024) begins to increase, which means that the significant digits begin to move away from the binary-point.

The stability only really matters when we have lots of values that are close to each other as they lead to what is known as catastrophic cancellation [3, 4, 5] in the floating point literature.

NOTE:
In this discussion we are not considering the potential loss of precision introduced by adding two numbers that differ in magnitude by the significant digits available (e.g. 1 + 1e-12), in these cases Kahan summation algorithm can be used to compensate for these errors.

For a more thorough comparison of several one-pass and two-pass algorithms for the computation of sample means and variance in terms of performance and accuracy see this article by Ling [6].

Overflow

sum += x;
navg = sum / i;

In the above code suppose we have large numbers, then the value of their sum may exceed DBL_MAX causing overflow which is not the case in the other approach.

Conclusions

In this article we've seen a numerically stable algorithm for computing the running mean, which corresponds to Welford's one-pass algorithm for the running variance. We've finally seen how and why it's considered better than the naive method, which comes to numerical stability and overflow. For the interested reader [6, 7] contains a thorough comparison of various algorithms for computing the running mean and variance, and a list of algorithms for computing the variance.

References

[1]: Welford BP. Note on a method for calculating corrected sums of squares and products. Technometrics. 1962 Aug 1;4(3):419-20.
[2]: What Every Computer Scientist Should Know About Floating-Point Arithmetic.
[3]: Wikipedia: Loss of significance.
[4]: Wikipedia: Catastrophic cancellation.
[5]: Computer Science: An Interdisciplinary Approach, Robert Sedgewick, Kevin Wayne, Addison-Wesley Professional; 1st edition (June 15, 2016).
[6]: Ling, Robert F. (1974). Comparison of Several Algorithms for Computing Sample Means and Variances. Journal of the American Statistical Association, Vol. 69, No. 348, 859-866.
[7]: Wikipedia: Algorithms for calculating variance.