FMA vs ILP

If you know about fused multiply-add (FMA) CPU instructions, you may have been thinking for a while now that our Gray-Scott computation is a good place to leverage them.

More specifically, if we look at an iteration of the update loop…

// Access the SIMD data corresponding to the center concentration
let u = win_u[STENCIL_OFFSET];
let v = win_v[STENCIL_OFFSET];

// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.rows().into_iter())
    .zip(win_v.rows())
    .zip(STENCIL_WEIGHTS)
    .flat_map(|((u_row, v_row), weights_row)| {
        (u_row.into_iter().copied())
            .zip(v_row.into_iter().copied())
            .zip(weights_row)
    })
    .fold(
        [Vector::splat(0.); 2],
        |[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
            let weight = Vector::splat(weight);
            [
                acc_u + weight * stencil_u,
                acc_v + weight * stencil_v,
            ]
        },
    );

// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(options.feedrate);
let killrate = Vector::splat(options.killrate);
let deltat = Vector::splat(options.deltat);
let ones = Vector::splat(1.0);

// Compute the output values of u and v
let uv_square = u * v * v;
let du = diffusion_rate_u * full_u - uv_square + feedrate * (ones - u);
let dv = diffusion_rate_v * full_v + uv_square - (feedrate + killrate) * v;
*out_u = u + du * deltat;
*out_v = v + dv * deltat;

…we can see that the computations of full_u/full_v is a big sum of products, the computation of du/dv is almost a big sum of products, and after that the computation of out_u/out_v is a product followed by a sum.

Given that modern hardware can execute FMA instructions at about the same rate as multiplication or addition instructions (2 instructions per clock cycle in either case for most¹ modern x86 CPUs), it is tempting to replace these sequences of multiplications and additions with FMA instructions. From this change, we expect two different benefits:

Increased numerical output precision, since in FMA there is no rounding between the multiplication and the addition.
Increased runtime performance, since we are effectively performing two floating-point operations for the price of one.

In this chapter, we will therefore attempt to do just that, and see how that works out.

FMA all the things

Enticed by the prospect of getting an easy 2x performance speedup, we might initially proceed to rewrite all the code using fused multiply-add operations…

// We must bring this trait in scope in order to use mul_add()
use std::simd::StdFloat;

// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.rows().into_iter())
    .zip(win_v.rows())
    .zip(STENCIL_WEIGHTS)
    .flat_map(|((u_row, v_row), weights_row)| {
        (u_row.into_iter().copied())
            .zip(v_row.into_iter().copied())
            .zip(weights_row)
    })
    .fold(
        [Vector::splat(0.); 2],
        |[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
            let weight = Vector::splat(weight);
            [
                // NEW: Introduced FMA here
                stencil_u.mul_add(weight, acc_u),
                stencil_v.mul_add(weight, acc_v),
            ]
        },
    );

// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(options.feedrate);
let killrate = Vector::splat(options.killrate);
let deltat = Vector::splat(options.deltat);
let ones = Vector::splat(1.0);

// Compute the output values of u and v
// NEW: Introduced even more FMA there
let uv_square = u * v * v;
let du = diffusion_rate_u.mul_add(full_u, (ones - u).mul_add(feedrate, -uv_square));
let dv = diffusion_rate_v.mul_add(full_v, v.mul_add(-(feedrate + killrate), uv_square));
*out_u = du.mul_add(deltat, u);
*out_v = dv.mul_add(deltat, v);

…and if you benchmark it on a modern AMD CPU, you will see that indeed, it does translates into a pretty nice performance improvement:

run_simulation/128x64/total16384/image1/compute
                        time:   [102.82 ms 102.91 ms 103.00 ms]
                        thrpt:  [1.3031 Gelem/s 1.3043 Gelem/s 1.3053 Gelem/s]
                 change:
                        time:   [−39.237% −39.160% −39.086%] (p = 0.00 < 0.05)
                        thrpt:  [+64.167% +64.366% +64.573%]
                        Performance has improved.

run_simulation/512x256/total1024/image1/compute
                        time:   [130.10 ms 131.11 ms 131.97 ms]
                        thrpt:  [1.0171 Gelem/s 1.0237 Gelem/s 1.0317 Gelem/s]
                 change:
                        time:   [−35.531% −34.886% −34.291%] (p = 0.00 < 0.05)
                        thrpt:  [+52.185% +53.577% +55.112%]
                        Performance has improved.

run_simulation/2048x1024/total64/image1/compute
                        time:   [149.09 ms 150.34 ms 151.01 ms]
                        thrpt:  [888.79 Melem/s 892.79 Melem/s 900.25 Melem/s]
                 change:
                        time:   [−27.119% −25.585% −24.074%] (p = 0.00 < 0.05)
                        thrpt:  [+31.707% +34.381% +37.210%]
                        Performance has improved.

run_simulation/4096x2048/total16/image1/compute
                        time:   [231.01 ms 231.91 ms 232.46 ms]
                        thrpt:  [577.38 Melem/s 578.74 Melem/s 581.00 Melem/s]
                 change:
                        time:   [−3.6146% −2.8374% −2.0884%] (p = 0.00 < 0.05)
                        thrpt:  [+2.1329% +2.9202% +3.7501%]
                        Performance has improved.

But is that the best that this hardware can do? Not yet. Because there is such a thing as too much of a good thing when it comes to FMA, as we will see next.

A dependency chain problem

Recall that most of the computational work within our Gray-Scott simulation code happens in the computation of the full_u/full_v variables and the du/dv variables that depend on it.

full_u/full_v are each the sum of 9 float-float products
du/dv are mostly the sum of two float-float products (one of which involves full_u/full_v) with a third addend that is itself the ternary product of three floats.

When we wrote these sums, we have not tried to make the additions happen in a particularly clever order, and as a result they end up following in a linear (((a + b) + c) + d) + ... pattern. Which the compiler will not disturb, because it does not allow itself to change the results of our code. So the same pattern will be present in the machine code that the compiler emits for the CPU.

As the CPU itself encounters this linear summation pattern, it is going to run into some trouble, because this sort of long summation patterns are latency-bound, not throughput-bound. In other words, even though the CPU is able to start computing two new additions per clock cycles, it will not be able to do so for these additions because they depend on each other. Instead, the CPU will have to wait for the result of the previous addition to be available before starting the next one. On newer x86 CPUs at the time of writing, this takes 3 cycles. It used to take 4 on older ones like Intel’s Skylake.

So e.g. the initial full_u computation, which involves a chain of 8 additions, cannot be computed in less than 8x3 = 24 CPU cycles on a CPU where addition has a 3-cycle latency. For the sake of comparison, if we had lots of independent additions to compute, the same CPU could have computed 2x24 = 48 additions in the same amount of time!

While the CPU is waiting for the addition of the previous addition to come out, it tries to keep itself busy by working on unrelated computations. For example, it can compute full_v in parallel with full_u. But that’s still not enough, because our typical x86 CPU has 2 SIMD ALUs, and in 3-4 cycles these could compute 6-8 floating-point operations.

Other ILP opportunities that the CPU picked in our original code include…

Computing the stencil weights multiplications, which are independent from each other, in parallel with each other. Now our CPU has about 4 operations to do per pair of additions to full_u and full_v + two extra multiplications representing the first elements of each multiply-add chain (as the compiler optimizes out addition of zero).
Performing some of the other operations within the formula of du/dv which do not depend on full_u and full_v, such as the uv_square product.
Preparing the next iteration of the loop by computing new pointers, indices, etc.

…and with that, we could have expected to get a CPU pipeline utilization in the 50-75% range. Not great, not terrible. But that was the original code with multiplications and additions. Now we have introduced FMA, which worsens the situation because…

By bundling multiplications with additions, FMA prevents the CPU from keeping itself busy with multiplications. The CPU now has to wait for the result of the previous addition to be available, before it can compute the next multiplication-addition pair.
Due to its higher complexity, FMA tends to have a higher latency. Even on CPUs where addition has a latency of 3 cycles, FMA has a latency of 4 cycles.

As a result, our CPU is computing floating-point operations faster, but with lower instruction-level parallelism, resulting in lower pipeline utilization. And this will result in a speedup that is smaller than the 2x theoretically provided than FMA, possibly to the point of performance stagnation/slowdown on some CPUs.

To avoid this problem, we should use FMA more sparingly and avoid long instruction dependency chains. By using a more balanced mix of FMA, multiplications and additions, we can expose more instruction-level parallelism to the CPU, and thus improve CPU pipeline utilization. And if done carefully this will result in a speedup that is greater than what can be achieved through either FMA or multiply/add pairs alone.

Adding instruction parallelism

Two-way ILP

As a first experiment, we add two-way instruction-level parallelism to the computation of each of the full_u/full_v gradients:

let gradient = |win: ArrayView2<_>| {
    let weight = |i: usize, j: usize| Vector::splat(STENCIL_WEIGHTS[i][j]);
    let mut acc1 = weight(0, 0) * win[[0, 0]];
    let mut acc2 = weight(0, 1) * win[[0, 1]];
    acc1 = weight(0, 2).mul_add(win[[0, 2]], acc1);
    acc2 = weight(1, 0).mul_add(win[[1, 0]], acc2);
    acc1 = weight(1, 1).mul_add(win[[1, 1]], acc1);
    acc2 = weight(1, 2).mul_add(win[[1, 2]], acc2);
    acc1 = weight(2, 0).mul_add(win[[2, 0]], acc1);
    acc2 = weight(2, 1).mul_add(win[[2, 1]], acc2);
    acc1 + acc2 + weight(2, 2) * win[[2, 2]]
};
let full_u = gradient(win_u);
let full_v = gradient(win_v);

To make what’s happening clearer, we are moving away from the iterator formalism for now. We have made the compiler aware of the fact that both the weight matrix and the input window are 3x3 matrices,² so this will not result in unnecessary bound checks.

Instead of using a single accumulator per full_u/full_v dependency chain, as we implicitly did before, we now use two accumulators, each handling every other weight * input product. Because our sum contains 9 terms, which is not divisible by 2, we put the remaining product in the final addition, where we combine the two partial sums together.

By doing things this way, we approximately double the amount of instruction-level parallelism (ILP) that is available to the CPU and halve the length of the longest sum/FMA dependency chain that is processed by the CPU during the diffusion gradient computation, at the expense of making less intensive use of FMA: we used to perform 8 FMAs, now we only perform 3 of them.

And on a Zen 2 CPUs, this version with more instruction-level parallelism exhibits a pretty nice speedup with respect to using FMAs as much as possible:

run_simulation/128x64/total16384/image1/compute
                        time:   [87.498 ms 87.548 ms 87.575 ms]
                        thrpt:  [1.5326 Gelem/s 1.5331 Gelem/s 1.5339 Gelem/s]
                 change:
                        time:   [−15.008% −14.903% −14.802%] (p = 0.00 < 0.05)
                        thrpt:  [+17.373% +17.514% +17.659%]
                        Performance has improved.

run_simulation/512x256/total1024/image1/compute
                        time:   [104.22 ms 104.79 ms 105.07 ms]
                        thrpt:  [1.2775 Gelem/s 1.2808 Gelem/s 1.2879 Gelem/s]
                 change:
                        time:   [−20.749% −19.887% −18.970%] (p = 0.00 < 0.05)
                        thrpt:  [+23.412% +24.824% +26.182%]
                        Performance has improved.

run_simulation/2048x1024/total64/image1/compute
                        time:   [132.04 ms 134.39 ms 136.11 ms]
                        thrpt:  [986.11 Melem/s 998.72 Melem/s 1.0165 Gelem/s]
                 change:
                        time:   [−10.965% −9.2942% −7.4509%] (p = 0.00 < 0.05)
                        thrpt:  [+8.0507% +10.247% +12.315%]
                        Performance has improved.

run_simulation/4096x2048/total16/image1/compute
                        time:   [178.55 ms 181.12 ms 184.11 ms]
                        thrpt:  [729.01 Melem/s 741.03 Melem/s 751.71 Melem/s]
                 change:
                        time:   [−22.986% −22.078% −21.042%] (p = 0.00 < 0.05)
                        thrpt:  [+26.650% +28.333% +29.847%]
                        Performance has improved.

Adding more ILP

Since increasing the instruction-level parallelism of the program seems beneficial so far, let’s not stop there and try to break down our computation into three parallel instruction streams:

let gradient = |win: ArrayView2<_>| {
    let weight = |i: usize, j: usize| Vector::splat(STENCIL_WEIGHTS[i][j]);
    let mut acc1 = weight(0, 0) * win[[0, 0]];
    let mut acc2 = weight(0, 1) * win[[0, 1]];
    let mut acc3 = weight(0, 2) * win[[0, 2]];
    acc1 = weight(1, 0).mul_add(win[[1, 0]], acc1);
    acc2 = weight(1, 1).mul_add(win[[1, 1]], acc2);
    acc3 = weight(1, 2).mul_add(win[[1, 2]], acc3);
    acc1 = weight(2, 0).mul_add(win[[2, 0]], acc1);
    acc2 = weight(2, 1).mul_add(win[[2, 1]], acc2);
    acc3 = weight(2, 2).mul_add(win[[2, 2]], acc3);
    acc1 + acc2 + acc3
};

Because the number of operations evenly divides the number of floating-point of accumulators, the computation ends up looking cleaner (each accumulator ends up processing a column of the 3x3 stencil). And we are not actually performing less FMAs here than in the previous solution.

So we expect this one to perform better, and indeed it does, but only slightly so. It looks like our CPU already has enough operations to keep its pipeline fully utilized, or we are running out of some other resources like CPU registers:

run_simulation/128x64/total16384/image1/compute
                        time:   [85.959 ms 86.173 ms 86.425 ms]
                        thrpt:  [1.5530 Gelem/s 1.5575 Gelem/s 1.5614 Gelem/s]
                 change:
                        time:   [−1.5432% −1.3827% −1.2560%] (p = 0.00 < 0.05)
                        thrpt:  [+1.2720% +1.4021% +1.5674%]
                        Performance has improved.

run_simulation/512x256/total1024/image1/compute
                        time:   [102.70 ms 103.18 ms 103.57 ms]
                        thrpt:  [1.2959 Gelem/s 1.3009 Gelem/s 1.3068 Gelem/s]
                 change:
                        time:   [−2.7136% −1.8784% −1.0620%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0734% +1.9144% +2.7892%]
                        Performance has improved.

run_simulation/2048x1024/total64/image1/compute
                        time:   [133.43 ms 135.55 ms 137.94 ms]
                        thrpt:  [973.01 Melem/s 990.14 Melem/s 1.0059 Gelem/s]
                 change:
                        time:   [−0.9114% +1.4255% +3.5529%] (p = 0.25 > 0.05)
                        thrpt:  [−3.4310% −1.4054% +0.9198%]
                        No change in performance detected.

run_simulation/4096x2048/total16/image1/compute
                        time:   [175.42 ms 178.03 ms 180.00 ms]
                        thrpt:  [745.64 Melem/s 753.92 Melem/s 765.10 Melem/s]
                 change:
                        time:   [−2.7993% −1.2133% +0.2464%] (p = 0.16 > 0.05)
                        thrpt:  [−0.2458% +1.2282% +2.8799%]
                        No change in performance detected.

Because of this, we are not very hopeful that four-way ILP will be helpful for this particular CPU…

let gradient = |win: ArrayView2<_>| {
    let weight = |i: usize, j: usize| Vector::splat(STENCIL_WEIGHTS[i][j]);
    let mut acc1 = weight(0, 0) * win[[0, 0]];
    let mut acc2 = weight(0, 1) * win[[0, 1]];
    let mut acc3 = weight(0, 2) * win[[0, 2]];
    let mut acc4 = weight(1, 0) * win[[1, 0]];
    acc1 = weight(1, 1).mul_add(win[[1, 1]], acc1);
    acc2 = weight(1, 2).mul_add(win[[1, 2]], acc2);
    acc3 = weight(2, 0).mul_add(win[[2, 0]], acc3);
    acc4 = weight(2, 1).mul_add(win[[2, 1]], acc4);
    (acc1 + acc2) + (acc3 + acc4) + weight(2, 2) * win[[2, 2]]
};

…and indeed, it hurts in a few configurations and does not help much in others:

run_simulation/128x64/total16384/image1/compute
                        time:   [84.362 ms 84.413 ms 84.445 ms]
                        thrpt:  [1.5894 Gelem/s 1.5900 Gelem/s 1.5910 Gelem/s]
                 change:
                        time:   [−2.3627% −2.2301% −2.0633%] (p = 0.00 < 0.05)
                        thrpt:  [+2.1068% +2.2810% +2.4199%]
                        Performance has improved.

run_simulation/512x256/total1024/image1/compute
                        time:   [100.20 ms 101.13 ms 101.67 ms]
                        thrpt:  [1.3201 Gelem/s 1.3271 Gelem/s 1.3395 Gelem/s]
                 change:
                        time:   [−2.4991% −1.7540% −0.9228%] (p = 0.00 < 0.05)
                        thrpt:  [+0.9314% +1.7853% +2.5631%]
                        Change within noise threshold.

run_simulation/2048x1024/total64/image1/compute
                        time:   [139.88 ms 140.54 ms 141.23 ms]
                        thrpt:  [950.33 Melem/s 955.01 Melem/s 959.52 Melem/s]
                 change:
                        time:   [+1.4447% +2.9856% +4.6570%] (p = 0.00 < 0.05)
                        thrpt:  [−4.4498% −2.8991% −1.4241%]
                        Performance has regressed.

run_simulation/4096x2048/total16/image1/compute
                        time:   [204.09 ms 204.51 ms 205.08 ms]
                        thrpt:  [654.47 Melem/s 656.30 Melem/s 657.65 Melem/s]
                 change:
                        time:   [+14.030% +15.216% +16.539%] (p = 0.00 < 0.05)
                        thrpt:  [−14.192% −13.206% −12.303%]
                        Performance has regressed.

Therefore, we will stick with three-way ILP on the solution branch of the course.

Higher-level code?

All these raw sequences of instructions with hardcoded array indices are not so nice to read and maintain, so we would rather move to higher-level formalisms. The three-way ILP version of the code lends itself to this because it is effectively operating over rows of data, which can be expressed using e.g. array operations…

let gradient = |win: ArrayView2<_>| {
    let weight = |i: usize, j: usize| Vector::splat(STENCIL_WEIGHTS[i][j]);
    let mut accs: [Vector<SIMD_WIDTH>; 3] =
        std::array::from_fn(|col| weight(0, col) * win[[0, col]]);
    for row in 1..3 {
        for col in 0..3 {
            accs[col] = weight(row, col).mul_add(win[[row, col]], accs[col]);
        }
    }
    accs.into_iter().sum()
};
let full_u = gradient(win_u);
let full_v = gradient(win_v);

…but because this is the hottest part of the code, the compiler is sensitive to every change we make, and even this straightforward transformation of the above loop results in a slight performance degradation on the Zen 2 CPU where we have benchmarked so far:

run_simulation/128x64/total16384/image1/compute
                        time:   [87.583 ms 87.749 ms 88.108 ms]
                        thrpt:  [1.5233 Gelem/s 1.5296 Gelem/s 1.5325 Gelem/s]
                 change:
                        time:   [+1.7104% +2.2198% +2.7237%] (p = 0.00 < 0.05)
                        thrpt:  [−2.6515% −2.1716% −1.6816%]
                        Performance has regressed.

run_simulation/512x256/total1024/image1/compute
                        time:   [107.48 ms 107.98 ms 108.17 ms]
                        thrpt:  [1.2408 Gelem/s 1.2430 Gelem/s 1.2488 Gelem/s]
                 change:
                        time:   [+2.5778% +4.0035% +5.2857%] (p = 0.00 < 0.05)
                        thrpt:  [−5.0204% −3.8494% −2.5130%]
                        Performance has regressed.

run_simulation/2048x1024/total64/image1/compute
                        time:   [142.56 ms 143.03 ms 143.47 ms]
                        thrpt:  [935.52 Melem/s 938.36 Melem/s 941.47 Melem/s]
                 change:
                        time:   [+3.1596% +4.8315% +6.6384%] (p = 0.00 < 0.05)
                        thrpt:  [−6.2251% −4.6088% −3.0628%]
                        Performance has regressed.

run_simulation/4096x2048/total16/image1/compute
                        time:   [202.14 ms 202.94 ms 203.96 ms]
                        thrpt:  [658.06 Melem/s 661.36 Melem/s 663.97 Melem/s]
                 change:
                        time:   [+13.108% +14.227% +15.429%] (p = 0.00 < 0.05)
                        thrpt:  [−13.367% −12.455% −11.589%]
                        Performance has regressed.

Since this is only losing a few percents of throughput in the common case (including our reference configuration with 2 million pixels, which is roughly equivalent to a full HD image) and 10% in the worst case, the author might have tolerated this performance loss it in exchange for higher maintainability in a long-lived production codebase.

But this course is all about performance, so we will stick with the best performing version on the solution branch by reverting this change.

Exercise

The astute reader will have noticed that we cannot compare between FMA and multiply-add sequences yet because we have only implemented ILP optimizations in the FMA-based code, not the original code based on non fused multiply-add sequences.

Therefore, after implementing and trying out this optimization, you may want to try to replace all FMAs with a * b + c sequences again, and see how the performance compares on your CPU.

When doing so, Intel CPU owners may be surprised by the results. What you are observing actually comes from floating-point arithmetic operations being artificially slowed down by an hardware limitation, which we will take care of in the next chapter.

Some recent Intel CPU cores, not mainstream at the time of writing, come packaged with a dedicated adder in addition to their two main ALUs that can perform FMA, MUL or ADD/SUB. These CPUs can therefore perform two FMA/MULs and an ADD/SUB per cycle, or three ADD/SUB per cycle. ↩
STENCIL_WEIGHTS is by definition a 3x3 array, and our stencil_iter custom iterator is carefully written in such a way that the compiler can trivially tell that the output windows are 3x3 arrays too. ↩

Keyboard shortcuts

Numerical Computing with Rust on CPU