Fast numerically stable moving average

| 7 min read

Calculating the moving average of a time series is a pretty simple problem. Just loop over each position, and add up the elements over the window:

void simple_ma(double* in, double* out, int N, int W) {
for (int i = 0; i < N - W + 1; i++) {
double sum = 0;
for (int j = 0; j < W; j++) {
sum += in[i + j];

out[i] = sum / W;

However, this has a time complexity of $O(NW)$ which means for a large $W$ (window size) it becomes quite slow. On my machine, with $N=1000 000$ and $W=1000$ the computation takes nearly a second. We can implement the summation with SIMD instructions, but it only gives us a speed up of a factor of 4 (on a typical AVX2 machine one can perform 4 double precision addition in a single instruction).

Naive approach

There is a very simple trick to speed the calculation up and reach a time complexity of $O(N+W)$. First calculate the sum of the first $W$ elements, then add and subtract the newest and oldest elements when moving the window:

void summing_ma(double* in, double* out, int N, int W) {
double sum = 0;
for (int i = 0; i < W; i++) {
sum += in[i];

out[0] = sum / W;
for (int i = W; i < N; i++) {
sum = sum + in[i] - in[i - W] ;
out[i - W + 1] = sum / W;

This is fast, however fails miserably when your numbers are of different scale. Let's check its output for the following input:

double in[] = {1, 1, 1, 1e17, 1, 1, 1, 1};

// Output
1 3.333e+16 3.333e+16 3.333e+16 0 0

After the fifth element, the result becomes zero, reaching a relative error of 100%. What has happened? We just hit the main problem of floating point calculations: catastrophic cancellation.

To understand catastrophic cancellation, let's review how floating point numbers are represented. Floating point numbers can be written as $$s\times a\times 2^k,$$ where $s\in \{-1,1\}$ is the sign, $a$ is the significand and $k$ is the exponent. $a$ is normalized such that it is between 1 and 2. This way its first digit is always 1 and can be omitted in the binary representation.

Multiplying two floating point numbers is easy, just multiply the significand and sum the exponents: $$s_1s_2\times a_1a_2\times 2^{k_1+k_2}.$$ It doesn't matter if one of the numbers is small and the other is large, the significands can be still multiplied, resulting in no loss of precision. However, that's not the case with additions. If one of the addends is much smaller than the other, there is a loss of precision. Essentially, an addition can be written as (for simplicity, assume both numbers are positive and $k_1>k_2$): $$a_1\times 2^{k_1}+a_2\times 2^{k_2} = (a_1+a_22^{k_2-k_1})\times 2^{k_1}.$$ The exponent will be the larger of the two and the significands are added with the second one scaled. If $k_1$ is much larger than $k_2$, $a_2$ will be scaled down by several digits. Since we have a limited number of bits to represent the significand, only the highest bits of $a_2^{k_2-k_1}$ are added to $a_1$. The rest is simply cut off, resulting in lost precision.

In extreme cases, the sum might not change at all:

1e17 + 1 = 1e17

That is, adding 1 to 1e17 does not change the value. How does this relate to our buffered moving average? When adding 1e17 to the sum, the variable sum becomes 1e17. On the fifth iteration, when 1e17 leaves the window, this is what the implementation does:

(sum + 1) - 1e17 = (1e17 + 1) - 1e17 = 1e17 - 1e17 = 0

So the output for the fifth average is zero. To make matters worse, sum is corrupted now! That is, the average of every subsequent window will be zero, introducing large errors forever.

This is a pretty serious flaw of the algorithm. Imagine you are collecting measurements, 1 million timestamps, and you have an erroneous number somewhere. Maybe you do some preprocessing and a near zero value ends up as a very large number after a division. Our fast moving average algorithm will produce nonsensical results after the bad measurement, making all the output useless. That is, a single measurement error invalidates your million elements long time series!

Fast and stable summing

The core problem with the summing_ma algorithm is that it buffers the total sum and we need subtractions to remove values that are leaving the window. We should cache the sum such that no subtractions are needed to refresh the results.

The idea is that instead of a linear sum, we calculate the sum in a different order: $$((1+2)+(3+4))$$ If we want to change the number 1 to 10, we only have to recalculate $(10+2)$ and $(10+2)+7$ (note that $3+4$ is already calculated). This can be visualized as a binary tree:

graph TD;

The bottom row holds the input numbers, and inner nodes store the sum of their children. The sum of the input numbers can be found in the root node. After changing 1 to 10 in the input window we get:

graph TD;

    classDef red stroke:#f21111,stroke-width:1.5px
    class A,B,E red

I've denoted the nodes that were updated by red. Note that for each level of the tree only one update is performed. The levels of the tree and thus the number of the updates is $\log_2 W.$ On every slide of the window one element is changed in the buffer so calculating all the moving averages will take $O(N\log W)$ steps. This is a significant speedup over $O(NW)$. In addition, we do not have any subtractions in our formula, just additions, so the results are much more stable.

The pseudocode of the algorithm looks like this:

  1. Initialize the tree by copying the values in the first window to the leaves and calculate the inner node values.
  2. Divide the value in the root node by $W$ and put it in the output array.
  3. Point pos to the first leaf node.
  4. Update the node at pos with the new value entering the window.
  5. Update all the ancestors of pos and calculate the new average based on the root node.
  6. Point pos to the next leaf.
  7. Repeat from Step 4 until finished.

The binary tree is almost-complete, meaning every level has all the nodes, except maybe for the last one. This allows for a very efficient implementation, storing the tree as an array:

Each level is stored continuously, starting with the root node. Moving up and down in the tree is easy. For the node at index $k$,

  • the parent's index is $\left\lfloor\frac{k-1}{2}\right\rfloor$,
  • the left child's index is $2k+1$,
  • the right child's index is $2k+2$.

The actual implementation is the following:

void tree_ma(double* in, double* out, int N, int W) {
// Allocate buffer
int d = ceil(log2(W));
double* buffer = (double*)malloc((1 << (d + 1)) * sizeof(double));
memset(buffer, 0, (1 << (d + 1)) * sizeof(double));

// Initialize the buffer, first the leaf nodes
for (int i = 0; i < W; i++) {
buffer[(1 << d) - 1 + i] = in[i];

// Initialize the non-leaf nodes
for (int i = (1 << d) - 2; i >= 0; i--) {
buffer[i] = buffer[2 * i + 1] + buffer[2 * i + 2];

out[0] = buffer[0] / W; // Insert the first element
int pos = 0; // Position of the oldest element in the buffer

for (int i = W; i < N; i++) {
// Overwrite the oldest element in the buffer
buffer[(1 << d) - 1 + pos] = in[i];

// Update the tree
for (int k = (1 << d) - 1 + pos; k > 0;) {
k = (k - 1) / 2;
buffer[k] = buffer[2 * k + 1] + buffer[2 * k + 2];

// Save the the average
out[i - W + 1] = buffer[0] / W;

// Step the buffer index
pos = (pos + 1) % W;


First, the leaves of the tree are set by copying the first W elements of the input. Then, the intermediate nodes are calculated backwards, which corresponds to a bottom-up traversal of the tree. Finally, the window is slid $N-W$ times. In each step, we copy the new element at the correct position, update its parents and move the position counter to the right.

Execution time

The following plot shows the running time of the different algorithms. I used a time series with fixed length of 1M elements. The $x$ axis shows the size of the window ($W$) and the y axis the runtime. I've also included an SIMD optimized version of the simple moving average algorithm that uses AVX2 instructions.

As you can see, the SIMD version of the simple algorithm is 4 times faster, since AVX2 allows 4 double precision additions in parallel. The tree based approach is slower at small window sizes but quickly takes over the non-SIMD simple algorithm (around $W=30$) and the SIMD algorithm (around $W=100$). When the window is large (>1000) it is way faster than any of the other methods. I've also included the summation based algorithm. As you can see, its runtime remains constant, probably we are just measuring noise anyway.

Further extensions

The tree-wise addition fixed the issue of catastrophic cancellation of the naive algorithm. However, we are even better than that. The proposed algorithm is essentially pairwise summation whose asymptotic error is $O(\epsilon\log N)$. In contrast, the simple linear summation method has an error of $O(\epsilon N)$.

The tree-wise summation algorithm is trivially parallelizable. If you have multiple time series, you can use SIMD instructions to calculate 4 time series at the same time, speeding up the computation even more.

The trick we exploited was that we can calculate the sum of a set of numbers recursively: first split the list in two, calculate the sums of the sublists and add the results together. Can we do the same with more complicated statistics, like standard deviation or correlation? While I haven't tried implementing them, both have formulas that calculate the value recursively. Variance of a list can be computed from two sublists' variance using Chen's formula. Schubert and Gertz provide equations for covariance calculation.