A better stencil

In the previous chapters of this course, we have strived to improve the performance of the code while producing results that are as close as possible to those of the reference C++ version.

To be more precise, we aimed for results to be bitwise identical as long as the C++ code is adjusted to use a correct discrete Laplacian stencil¹ and the C++ compiler does not perform optimizations that alter the results of floating-point expressions.²

But at some point, we cannot keep ignoring forever the fact that the floating-point expressions used by the reference C++ code are not, themselves, optimal from a performance point of view. And thus, starting with this chapter, we will introduce some optimizations that alter the numerical output of the program in the name of runtime performance.

In doing so, however, we will be careful not to alter the scientific validity of the simulation output: our output-altering optimizations will strive to produce results that are either as precise as the current ones, or even more precise than they used to be.

Diffusion gradient analysis

At this point in the course, our diffusion gradient computation looks like this:

// In exercises/src/lib.rs

let [full_u, full_v] = (win_u.rows().into_iter())
    .zip(win_v.rows())
    .zip(STENCIL_WEIGHTS)
    .flat_map(|((u_row, v_row), weights_row)| {
        (u_row.into_iter().copied())
            .zip(v_row.into_iter().copied())
            .zip(weights_row)
    })
    .fold(
        [Vector::splat(0.); 2],
        |[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
            let weight = Vector::splat(weight);
            [
                acc_u + weight * (stencil_u - u),
                acc_v + weight * (stencil_v - v),
            ]
        },
    );

In this loop, we take each (U, V) value from a 3x3 neighborhood around the output position, subtract the (U, V) value at the center of the neighborhood to quantify the difference in chemical concentration, and apply a weight to account for some neighbors being spatially closer than others.

The result of each of these individual computations tells us the propension of a chemical species to migrate from a neighboring concentration grid cell to the active grid cell (if positive), or from the active grid cell to neighboring cells (if negative).

And by summing these per-neighbor contributions and applying a global weight that encodes how easy it is for a chemical species to migrate from a high-concentration region to a low-concentration region, we will get the net flux of chemicals towards or away from the active grid cell.

Overall, if we were computing with real numbers on a computer with infinite processing power, there would be nothing wrong with this mathematical formula. But we are actually computing with floating-point numbers on a computer with finite processing power, and from this perspective the current form of the computation has a few issues worth pointing out:

When a spatial region of the simulation reaches a steady state, the (stencil_u - u) and (stencil_v - v) expressions are subtracting numbers of similar magnitude, and from a floating-point math perspective this creates a risk of…
- Floating-point cancelations producing low-precision results.
- Excessively small differences falling into the subnormal range, which can cause precision and performance issues (a later chapter is dedicated to this problem).
By summing these cancelation outputs, which in the steady state are imprecise, of small and comparable magnitude, and have on average an even balance of positive and negative signs, we introduce even more opportunities for floating-point cancelation. This may degrade the precision of the final output even more.
The (stencil_u - u) expression and its V counterpart are returning 0 at the center of the stencil. If the compiler does not manage to optimize this useless computation out (and it may not manage due to IEEE-754 edge cases like NaN and inf³), it will waste CPU time.
From an efficiency perspective, we are subtracting u from stencil_u eight to nine times (see above), and the same for V, when we could instead precompute the sum of stencil weights before the simulation runs, and subtract the product of this sum by u once. Again, the compiler is not allowed to perform this optimization as it changes the output of the program because the rounding associated with the subtraction is not performed in the same place(s).

By fixing these problems, we will get a diffusion gradient computation that…

Features only a single operation with a floating-point cancelation hazard (which is the minimum possible in a gradient computation) instead of multiple cascading ones, and should therefore produce a more precise and reliable floating-point output.
Does not perform any useless subtraction, and should therefore perform faster by virtue of performing less floating-point computations overall.

Fixing the stencil

Our STENCIL_WEIGHTS matrix is almost a discrete Laplace operator, but with one very important difference. If you go to the associated Wikipedia page, you will find that matrices for this operator feature a central weight that is negative (unlike neighbor weights with are positive) and has an absolute value that is equal to the sum of all neighbor weights.

So let’s see what happens if we switch to this alternate stencil weights matrix.

// Change in exercises/src/options.rs

/// Weights of the discrete convolution stencil
#[rustfmt::skip]
pub const STENCIL_WEIGHTS: [[Float; 3]; 3] = [
    [0.25, 0.5, 0.25],
    [0.5, -3.0, 0.5],  // <- New negative center weight
    [0.25, 0.5, 0.25]
];

The way our computation is currently written, changing the central weight of the stencil negative does not change the computation’s output, because the central weight of the stencil is multiplied by zero and therefore does not contribute to the full_u and full_v results.

But with this alternate weights matrix, we get the desirable mathematical property that if we were operating in the set of real numbers, and not floating-point numbers, then we would get the exact same numerical result by removing the subtraction of the central u/v value as follows:

// Change in exercises/src/lib.rs

let [full_u, full_v] = (win_u.rows().into_iter())
    .zip(win_v.rows())
    .zip(STENCIL_WEIGHTS)
    .flat_map(|((u_row, v_row), weights_row)| {
        (u_row.into_iter().copied())
            .zip(v_row.into_iter().copied())
            .zip(weights_row)
    })
    .fold(
        [Vector::splat(0.); 2],
        |[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
            let weight = Vector::splat(weight);
            [
                acc_u + weight * stencil_u,  // <┬ Subtractions are gone
                acc_v + weight * stencil_v,  // <┘
            ]
        },
    );

In the set of floating-point numbers, the result will not be the same because, as discussed earlier we are not performing floating-point rounding in the same places. But the good news is that the result that we get is not just going to be different, it will actually be of higher numerical quality:

All multiplications of neighbouring stencil values by the weight are numerically safe. They just multiply a number whose magnitude is close to unity by a number whose magnitude is slightly lower. Because our weights happen to be powers of two, the result will actually happen to be exact, with no loss of floating-point precision with respect to the input stencil value.
Summing the products of stencil values by weights, which are positive numbers of roughly comparable magnitude (within a factor of 2 of each other when the simulation is in a steady state), will produce a result that is not exact but of high numerical quality.
So the only place where we still get a cancelation risk with this formulation is at the point where we implicitly subtract the central U/V value by multiplying it with a negative weight and adding it to the accumulator. But now this only happens once, which is the minimal amount of subtractions that we can get in a gradient-like computation.

Therefore, we can conclude that these two changes (negative stencil weight and removal of the subtractions) result in a program that should perform faster (by virtue of performing 2x9 = 18 less computations per output data point) AND produce higher-quality results. Sometimes, even in the wild world of floating-point computations, we can have our cake and eat it too!

Exercise

Implement the optimization discussed above and measure how it affects the performance of the simulation on your CPU.

The Laplacian stencil that is used by the reference C++ version at the time of writing puts the same weight on the diagonal edges as on the top/left/right/bottom neighbor. This does not correctly reflect spatial distances on an Eulerian 2D grid and will result in excessive diffusion of chemical species along the diagonal axes compared to the horizontal/vertical axis. The author of the reference C++ version has been made aware of this problem, but sadly he does not want to change the stencil expression because he thinks the incorrect stencil makes the output images look better… ↩
In theory, C++ compilers are bound by the C++ standard to optimize code according to the as if rule, which states that outputs of a program should not be affected by program optimizations. But because this rule prevents many optimizations of floating-point expressions, many compilers tend to treat it as more of a guideline. For example, GCC allows itself to transform a * b + c into fma(a, b, c) in its default configuration, and Intel compilers enable a “safe” subset of -ffast-math optimizations by default. In -ffast-math mode, compilers allow themselves to transform floating-point expressions in any manner that is valid on mathematical real numbers, which means that the numerical output of a program is not reproducible anymore and can change from one compiler version to another. In contrast, Rust strictly follows the as if rule, leaving floating-point optimizations fully in the hands of the programmer. ↩
In the world of IEEE-754 floating-point, the sum of +inf and -inf is NaN. This means that if u happens to be inf at the center position only, and have normal values elsewhere, the full_u result be NaN if the center term is included and -inf otherwise. The compiler may not be able to prove that the U and V concentration tables do not contain infinite IEEE-754 values, because this requires whole-program analysis that goes from the point where the dataset is initialized to the point where all simulation iterations are run, and proves that no other unknown code (like, say, the HDF5 library) may modify the values of U and V between two simulation iterations. That sort of analysis is very hard for compiler optimizers, and usually fails, so instead the compiler must assume that the U and V dataset contain any possible IEEE-754 floating-point value. When that is the case, the optimized program must perform the almost-always-useless central (stencil_u - u) computation and integrate it into the full_u accumulator, in order to preserve the as-if rule guarantee that the optimized program produces the same output as the unoptimized program in all possible runtime circumstances. ↩

Keyboard shortcuts

Numerical Computing with Rust on CPU

A better stencil

Diffusion gradient analysis

Fixing the stencil

Exercise