Analyzing machine code

Our optimization efforts at the heart of the simulation have taken us to the point where we are computing output data points at a pretty high rate compared to what we started with:

run_simulation/128x64/total16384/image1/compute
                        time:   [88.342 ms 88.415 ms 88.469 ms]
                        thrpt:  [1.5171 Gelem/s 1.5180 Gelem/s 1.5193 Gelem/s]

run_simulation/512x256/total1024/image1/compute
                        time:   [83.000 ms 84.254 ms 85.751 ms]
                        thrpt:  [1.5652 Gelem/s 1.5930 Gelem/s 1.6171 Gelem/s]

run_simulation/2048x1024/total64/image1/compute
                        time:   [135.99 ms 138.66 ms 140.84 ms]
                        thrpt:  [952.99 Melem/s 967.97 Melem/s 986.98 Melem/s]

run_simulation/4096x2048/total16/image1/compute
                        time:   [221.10 ms 221.60 ms 222.27 ms] thrpt:  [603.84
                        Melem/s 605.68 Melem/s 607.05 Melem/s]

But there are two questions that we should be asking ourselves right now:

How far are we from peak hardware performance at small problem scales?
Why is it the case that performance did not improve as much for the larger problem sizes as it did for the smaller problem sizes?

In this chapter, we will investigate the first question. And in doing so, we will also prepare the way for subsequent chapters, which will answer the second question.

Counting clock cycles

To put the small scale numbers in the perspective of raw hardware capabilities, let’s look into some execution statistics for the 512x256, which performs best at the moment:

$ perf stat cargo bench --bench simulate -- 'run_simulation/512x256/total1024/image1/compute$'

       [ ... benchmark output ... ]

      28 934,68 msec task-clock                       #    1,379 CPUs utilized
         11 660      context-switches                 #  402,977 /sec
          1 076      cpu-migrations                   #   37,187 /sec
        199 617      page-faults                      #    6,899 K/sec
295 945 818 356      instructions                     #    2,44  insn per cycle
                                                      #    0,02  stalled cycles per insn     (80,01%)
121 371 248 616      cycles                           #    4,195 GHz                         (79,98%)
  5 299 370 710      stalled-cycles-frontend          #    4,37% frontend cycles idle        (79,98%)
 22 923 089 669      branches                         #  792,236 M/sec                       (80,03%)
    208 598 966      branch-misses                    #    0,91% of all branches
    (80,00%)

The two most important pieces of information here are that…

We’re executing more than 2 instructions per clock cycles, which is the maximal execution rate for most CPU instructions (loads, multiplications, FMAs…). This suggests that we manage to execute instructions of different type in parallel, which is good.
Our average CPU clock rate when executing this benchmark is 4.2 GHz, which when divided by the output production rate tells us that it takes us about about 2.6 CPU clock cycles to produce one output data point.

We also know that this Zen 2 CPU uses AVX vectorization and we work using single-precision floating point operations, which means that we process 256/32 = 8 outputs per inner loop cycle. From this, we can deduce that it takes us 21 clock cycles to produce one output data point.

But is that a lot? A little? To know this, we need to take a look at our workload.

Setting the goal

Let’s take a look at the current version of our simulation loop:

for (win_u, win_v, out_u, out_v) in stencil_iter(start, end) {
    // Access the SIMD data corresponding to the center concentration
    let u = win_u[STENCIL_OFFSET];
    let v = win_v[STENCIL_OFFSET];

    // Compute the diffusion gradient for U and V
    let gradient = |win: ArrayView2<_>| {
        let weight = |i: usize, j: usize| Vector::splat(STENCIL_WEIGHTS[i][j]);
        let mut acc1 = weight(0, 0) * win[[0, 0]];
        let mut acc2 = weight(0, 1) * win[[0, 1]];
        let mut acc3 = weight(0, 2) * win[[0, 2]];
        acc1 = weight(1, 0).mul_add(win[[1, 0]], acc1);
        acc2 = weight(1, 1).mul_add(win[[1, 1]], acc2);
        acc3 = weight(1, 2).mul_add(win[[1, 2]], acc3);
        acc1 = weight(2, 0).mul_add(win[[2, 0]], acc1);
        acc2 = weight(2, 1).mul_add(win[[2, 1]], acc2);
        acc3 = weight(2, 2).mul_add(win[[2, 2]], acc3);
        acc1 + acc2 + acc3
    };
    let full_u = gradient(win_u);
    let full_v = gradient(win_v);

    // Compute SIMD versions of all the float constants that we use
    let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
    let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
    let feedrate = Vector::splat(options.feedrate);
    let killrate = Vector::splat(options.killrate);
    let deltat = Vector::splat(options.deltat);
    let ones = Vector::splat(1.0);

    // Flush overly small values to zero
    fn flush_to_zero<const WIDTH: usize>(x: Simd<f32, WIDTH>) -> Simd<f32, WIDTH>
    where
        LaneCount<WIDTH>: SupportedLaneCount,
    {
        const THRESHOLD: f32 = 1.0e-15;
        x.abs()
            .simd_lt(Simd::splat(THRESHOLD))
            .select(Simd::splat(0.0), x)
    }

    // Compute the output values of u and v
    let uv_square = u * v * v;
    let du = diffusion_rate_u.mul_add(full_u, (ones - u).mul_add(feedrate, -uv_square));
    let dv = diffusion_rate_v.mul_add(full_v, v.mul_add(-(feedrate + killrate), uv_square));
    *out_u = flush_to_zero(du.mul_add(deltat, u));
    *out_v = flush_to_zero(dv.mul_add(deltat, v));
}

If we think about the minimal amount of SIMD processing work that our CPU is doing for each output computation, i.e. the work that is dependent on a particular sequence of inputs and cannot be deduplicated by the compiler across loop iterations, we see that…

We load 9 input vectors for each species, 9x2 = 18 overall.
- Of these 9x2 inputs, 6x2 are shared with the previous computation due to our sliding window pattern. For example, the central value of a particular input window is the left neighbor of the next input window. But for register pressure reasons that will be covered later in this chapter, the compiler is unlikely to be able to leverage this property in order to avoid memory loads on the vast majority of x86 CPUs.
We compute 3 multiplications for the diffusion gradient of each species and 2 more in the uv_square computation, so 3x2 + 2 = 8 multiplications overall.
For each species we compute 6 FMAs within the diffusion gradient, 2 more in du/dv, and 1 more for the final u/v output, so 9x2 = 18 FMAs overall.
We compute two additions for the diffusion gradient of each species and one subtraction for the (ones - u) part of du. Therefore we expect a total of 5 runtime ADD/SUB operations. We don’t expect more additions/subtractions/negations, even though the code would suggest that there should be more of them happening, because…
- The -(feedrate + killrate) computation should be automatically extracted from the loop by the compiler because it operates on loop-invariant inputs and always produces the same result.
- The negative -uv_square addend in the FMA of du can be handled by a special form of the x86 FMA instruction called FMSUB.
For each u/v output we perform a flush-to-zero operation that involves an absolute value, a SIMD comparison and a SIMD masked selection, that’s at least 3x2 = 6 more SIMD operations.
Finally we store 1 output vector for each species, so we have 2 stores overall.

So we know that as a bare minimum, assuming the compiler did everything right, we need to perform 57 SIMD operations to compute an output data point. And if we account for the fact that x86 is a complex instruction set where most arithmetic instructions can also load an input from memory, but FMA cannot store data to memory, we reach the conclusion that these operations cannot be encoded into less than 57-18 = 39 x86 instructions.

There has to be more to the actual compiler-generated code for the inner loop, however, because our simulation has many constant inputs:

The STENCIL_WEIGHTS matrix contains three distinct values: 0.25, 0.5 and -3.0.
DIFFUSION_RATE_U and DIFFUSION_RATE_V are two other distinct constants, 0.1 and 0.05.
We have three runtime-defined parameters whose value is unknown to the compiler, feedrate, killrate and deltat. deltat and feedrate are used as-is, while killrate is used as part of the -(feedrate + killrate) expression.
We need the constant 1.0 for the (ones - u) computation.
We need THRESHOLD and the constant 0.0 for the flush_to_zero computation.

That’s 11 constants, which is a lot for the majority of x86 CPUs that do not implement AVX-512 but only AVX. Indeed, the AVX instruction set only features 16 architectural vector registers and its vector operations only support inputs from memory and these vector registers. Which means that if we keep all constant state listed above resident in registers, we only have 5 registers left for the computation, and that is not a lot when we aim to implement the computation in a manner that exhibits a high degree of instruction-level parallelism.

As a result, compilers will usually not try to keep all of these constant values resident in registers. They will only keep some of them, and store the other ones in memory at the start or update(), reloading them on each loop iteration. This register spilling, as it is called, means that our inner loop will have more memory loads than it strictly needs by the above definition, in the name of performing other computations more efficiently using the CPU registers that this frees up.

Finally, so far we have only focused on the computation of one output data point, but that computation appears within a loop. And managing that loop, in turn, will require more instructions. The minimal number of CPU instructions required to manage a simple 1D loop is a loop counter increment, a comparison that checks if the counter has reached its final value, and a conditional branch that takes us back to the beginning of the loop if we’re not done yet.

We could follow this simple logic in our stencil_iterator() iterator implementation by tracking an offset from the start of each input and output dataset, which happens to be the same for both input and output data. And indeed, an earlier version of stencil_iterator() did just that. But due to arcane limitations of x86 memory operands, it is surprisingly faster on x86 CPUs to have a loop that is based on 4 pointers (two input pointers, two output pointers), that each get incremented on each loop iteration, because this speeds up the indexing of actual input data later in the diffusion gradient computation. Which is why stencil_iterator() does that instead.

When all these things are taken into account, it is actually quite impressive that according to perf stat, our program only executes an average of 21 * 2.44 = 51.24 instructions per output data vector. That is only only 7 CPU instructions more than the minimum of 39+4+2 = 44 instructions required to perform the basic computation and manage a loop counter. So we can be quite confident that the compiler produced rather reasonable machine code for our computation.

But confidence is not enough when we try to reach peak performance, so let’s take a deeper look at the hot CPU code path with perf record.

Analyzing execution

Collecting data with `perf`

We can have perf analyze where CPU cycles are spent using perf record and perf report:

perf record -- cargo bench --bench simulate -- 'run_simulation/512x256/total1024/image1/compute$' --profile-time=100  \
&& perf report

As you can see, 94% of our CPU time is spent in the AVX + FMA version of the update() function, as we would expect. The remaining time is spent in other tasks like copying back top values to the bottom and vice-versa (1.68%), and nothing else stands out in the CPU profile.

We can then display annotated assembly for the update function using the a key, and it will highlight the hottest instructions, which will let us easily locate the inner loop:

What are we looking at here? Recall that in a nutshell, perf record works by periodically stopping the program and querying which instruction the CPU is currently processing. From this information, perf annotate computes a histogram of assembly instructions where CPU time is most often spent. And what we are looking at here is this histogram, with color coding added to highlight the CPU instructions which the program was most often executing at the time where a sample was taken, which are most likely to be a performance bottleneck.

Now, we should be careful about over-interpreting this data, because “which instruction the CPU is currently processing” is a trick question on a modern superscalar, out-of-order and deeply pipelined CPU. perf and the underlying CPU performance monitoring infrastucture try to avoid producing a biased answer by using various techniques, such as randomly attributing a particular sample to any of the instructions that were actively executing at sampling time, but these techniques not foolproof.

To give two examples…

Since consecutive pairs of comparison and conditional jump x86 instructions are actually processed as a single operation by the backend of all current x86 CPUs. Thus an imbalance of sample counts attributed to the comparison and jump is to be expected, as the CPU artificially spreads the cycle count between these two instructions.
When an x86 instruction has a memory operand, it is conversely split in two operations at the backend level, one memory load and one operation that operates from registers. Therefore, it is normal if the measured overhead for this instruction is higher than that of an instruction that has a register operand, because the memory operand does more work.

But overall, we can still get significant mileage out of this data numbers by following the general intuition that if some instructions get a high percentage of samples in a certain region of assembly, it suggests that there are hot instructions around this location in the program binary.

So in the example above, vmulps might not be the actual hottest CPU instruction, the lea above or the vfmadd231ps below might be the one instead.

In any case, without getting to the level of fine-grained analysis where the above considerations are a problem, we can rather easily figure out what control flow path is taken by the inner loop of the simulation by following the trail of high sample frequencies:

We know that every inner loop iteration starts at location 168:, as there is negligible sample coverage before that.
We know that the jb 2aa conditional branch right after location 286: is usually taken, because the assembly after it has only little sample coverage. Which suggests that this assembly belongs to the outer loop over lines of output, not the inner loop over storage cells within a line. Inner loop control flow must continue at location 2aa:.
We know that the je 160 conditional branch after location 2aa: is rarely if ever taken, because there is no sample coverage for the associated assembly at target location 160:.
We know that the jne 2d0 conditional branch coming right after that is rarely if ever taken, because there is no sample coverage at the associated 2d0: location.
And then the hot code path reaches a jmp 168 jump instruction that takes us back to the beginning of the hot code path, starting the next inner loop iteration.

Now, there are a couple of worrying things in this control flow which suggests that our unsafe iterator implementation could be improved a little to hint the compiler into generating better code:

The two never-taken branches suggest that the code performs a check for a condition that should always be true, especially as they repeatedly test the value of the r8 register which never changes in the inner loop. In other words, the compiler did not manage to remove or deduplicate a redundant check.
To optimize L1i cache locality, compilers try to make the hottest code path follow a straight line, and to make the exceptional path jump away from this straight line. But here outer loop iterations are on the straight line path, and inner loop iterations jump around it. This will not incur a big performance penalty on modern CPUs, but it does indicate that the compiler did not understand the relative likelihood of the inner and outer loop branches. Which in turn can negatively affect generated code if the compiler takes a gamble based on expected branching frequency that turns out to be wrong.

But so far that’s just a couple suboptimal logic instructions in what is otherwise a sea of SIMD operations, so fixing these little flaws seems unlikely to bring a sizeable performance benefit and we will therefore save that work for a rainy day.

Anyway, given the above control flow analysis, we can extract the assembly that actually gets executed by the inner loop, which boils down to this:

 1971 │168:   mov          r10,r15
 2006 │       shl          r10,0x5
 1885 │       lea          rdx,[r14+r10*1]
 2049 │       mov          rbx,rbp
 3900 │       shl          rbx,0x5
 1969 │       vmovaps      ymm11,YMMWORD PTR [rbx+rdx*1]
 2213 │       lea          r9,[r12+r10*1]
 2126 │       shl          r15,0x6
 2115 │       lea          rsi,[r14+r15*1]
 4263 │       vmulps       ymm13,ymm3,YMMWORD PTR [r14+r15*1]
 2135 │       vfmadd231ps  ymm13,ymm4,YMMWORD PTR [rbx+rsi*1]
 2140 │       vmulps       ymm14,ymm4,YMMWORD PTR [r14+r10*1]
 2123 │       vmulps       ymm15,ymm3,YMMWORD PTR [r14]
 2053 │       vfmadd231ps  ymm15,ymm4,YMMWORD PTR [r14+rbx*1]
 4185 │       vmovaps      ymm12,YMMWORD PTR [rbx+r9*1]
 1972 │       shl          rbp,0x6
 2498 │       vfmadd231ps  ymm15,ymm3,YMMWORD PTR [r14+rbp*1]
 5187 │       vfmadd231ps  ymm14,ymm11,ymm5
 5192 │       vfmadd231ps  ymm14,ymm4,YMMWORD PTR [rbp+rdx*1+0x0]
 2617 │       vaddps       ymm14,ymm15,ymm14
 5800 │       vfmadd231ps  ymm13,ymm3,YMMWORD PTR [rbp+rsi*1+0x0]
 5820 │       vaddps       ymm13,ymm14,ymm13
 6008 │       lea          rdx,[r12+r15*1]
11731 │       vmulps       ymm14,ymm3,YMMWORD PTR [r12+r15*1]
 6031 │       vfmadd231ps  ymm14,ymm4,YMMWORD PTR [rbx+rdx*1]
 5851 │       vmulps       ymm15,ymm4,YMMWORD PTR [r12+r10*1]
11699 │       vmulps       ymm6,ymm3,YMMWORD PTR [r12]
11724 │       vfmadd231ps  ymm6,ymm4,YMMWORD PTR [r12+rbx*1]
 5816 │       vfmadd231ps  ymm6,ymm3,YMMWORD PTR [r12+rbp*1]
 3904 │       vfmadd231ps  ymm15,ymm12,ymm5
 7633 │       vfmadd231ps  ymm15,ymm4,YMMWORD PTR [rbp+r9*1+0x0]
 7676 │       vaddps       ymm6,ymm15,ymm6
 3823 │       vfmadd231ps  ymm14,ymm3,YMMWORD PTR [rbp+rdx*1+0x0]
 2438 │       vaddps       ymm6,ymm14,ymm6
 4872 │       vmulps       ymm14,ymm11,ymm12
 4801 │       vmulps       ymm14,ymm12,ymm14
 2489 │       vbroadcastss ymm15,DWORD PTR [rip+0xfffffffffff716bd]        # 2b218 <core::num::dec2flt::table::POWER_OF_FIVE_128+0x2b48>
 2296 │       vsubps       ymm15,ymm15,ymm11
 4531 │       vfmsub213ps  ymm15,ymm0,ymm14
 4419 │       vfmadd231ps  ymm15,ymm7,ymm13
 2331 │       vfmadd231ps  ymm14,ymm12,ymm2
 2303 │       vfmadd231ps  ymm14,ymm8,ymm6
 4674 │       vfmadd213ps  ymm15,ymm1,ymm11
 4421 │       vandps       ymm6,ymm15,ymm9
 2429 │       vcmpnltps    ymm6,ymm6,ymm10
 2471 │       vandps       ymm6,ymm15,ymm6
 4661 │       vmovaps      YMMWORD PTR [r13+0x0],ymm6
 4453 │       vfmadd213ps  ymm14,ymm1,ymm12
 2228 │       vandps       ymm6,ymm14,ymm9
 2358 │       vcmpnltps    ymm6,ymm6,ymm10
 2278 │       vandps       ymm6,ymm14,ymm6
 2391 │       vmovaps      YMMWORD PTR [rcx],ymm6
 4560 │       add          r13,0x20
 2164 │       add          r12,0x20
 2025 │       add          r14,0x20
 3874 │       add          rcx,0x20
 3888 │286:   cmp          rcx,rdi
      │     ↓ jb           2aa  # Skipped outer loop code path below this...
 2134 │2aa:   test         r8,r8  # ...to get back on the main path
   43 │     ↑ je           160
 1970 │       cmp          r8,0x2
   40 │     ↓ jne          2d0
 1946 │       mov          r15,QWORD PTR [rsp+0x8]
 1970 │       mov          rbp,QWORD PTR [rsp+0x10]
 3792 │     ↑ jmp          168

…and now that we have it, let’s see what useful information we can get out of it.

Analyzing SIMD instructions

If we look at the actual mix of generated assembly instructions compared to our theoretical predictions, we see that…

As predicted, we have 8 SIMD multiplications.
- 6 of these have a memory operand (easily recognizable as a YMMWORD PTR assembly argument), which means that they load input data, and that identifies them as initializing the accumulator of one of the ILP streams within the stencil computation.
- The other 2 multiplications, which take only register arguments, must by way of elimination come from the uv_square computation.
As predicted, we have 18 FMA operations.
- Only 10 of them have memory operands, instead of the expected 12 FMAs from the stencil computation. The reason for this is that the central u and v values are used in multiple places within the computation, and therefore they are loaded by two dedicated vmovaps instructions to allow reuse. These instructions can be used for either loading or storing data, those that load data are recognizable by the fact that the target register comes as a first argument and the source memory comes as a second argument.
- As also predicted, we have a single VFMSUB, corresponding to the “addition” of -uv_square in the computation of du. Every other instruction is a VFMADD.
As predicted, we have 4 additions and 1 subtraction (the latter of which comes from the (ones - u) term of the simulation).
As hoped, our flush-to-zero logic translates into three SIMD instructions:
- A vandps which computes the absolute value of our inputs by performing a logical AND with a precomputed boolean mask where all bits are set except for the IEEE-754 sign bit. Note that this means our computation now depends on one more constant value, further increasing SIMD register pressure…
- A vcmpnltps which produces a vector of fake floating-point numbers whose binary representation is set to all-one bits where the first input is not less than the second input, and all-zero bits otherwise.
- A vandps which performs a logical AND between this mask vector and the original input, thusly returning zero if the first input is less than the second input. Otherwise the original input value is returned.
And finally, we have a pair of vmovaps that store back the simulation output to memory, which are recognizable by the fact that their YMMWORD PTR annotated memory operand comes first.

This only leaves one SIMD instructions of unidentified purpose, a vbroadcastss, which loads a scalar value from memory and puts a copy of it in each lane of a SIMD register. Its most likely purpose is to load one constant from memory, which the compiler could not keep in a SIMD register without sacrificing too much instruction-level parallelism as discussed above.

So far, it looks like the compiler is doing a pretty great job with the SIMD code that we gave it!

Analyzing non-SIMD logic

If we now take a look at the surrounding non-SIMD logic, we first find the basic inner loop iteration logic discussed earlier, based on 4 pointer increments instead of 4 base pointers sharing one single global offset because curiously enough this is faster on x86…

4560 │       add          r13,0x20
2164 │       add          r12,0x20
2025 │       add          r14,0x20
3874 │       add          rcx,0x20
3888 │286:   cmp          rcx,rdi
     │     ↓ jb           2aa

…then the aforementioned pair of useless tests that should be investigated someday…

2134 │2aa:   test         r8,r8
  43 │     ↑ je           160
1970 │       cmp          r8,0x2
  40 │     ↓ jne          2d0

…and then a pair of integer loads from main memory, just before we go back to the top of the loop.

1946 │       mov          r15,QWORD PTR [rsp+0x8]
1970 │       mov          rbp,QWORD PTR [rsp+0x10]
3792 │     ↑ jmp          168

Loads from the stack most frequently originate from the compiler running out of registers and spilling some input values to the stack. But we will check how these values are used later to make sure, because such loads can also come from problems with the compiler’s alias analysis, which means that the compiler feels forced load to reload the same values from memory over and over again even though they haven’t changed, because it fails to prove that they haven’t changed.

If we now look at the top of the loop, we first notice 4 lea instructions, which abuse the CPU’s addressing units to perform integer computations. This is something to be vigilant about in x86 assembly, because these operations can trigger the slow path of the CPU’s addressing unit if they use overly complicated address calculations. But here they are simply being used to compute additions, which is as fast as an integer ADD if unnecessarily fancier.

1885 │       lea          rdx,[r14+r10*1]
             # ... other instructions ...
2213 │       lea          r9,[r12+r10*1]
             # ... other instructions ...
2115 │       lea          rsi,[r14+r15*1]
             # ... other instructions ...
6008 │       lea          rdx,[r12+r15*1]

By looking around, we can see that these integer additions are used to work around a limitation of x86 addressing modes:

The position of an input within a 3x3 window is given by the sum of three values, namely the pointer to the top-left corner of the input window (which we maintain as part of the loop logic), the offset associated with the target row, and the offset associated with the target column.
But an x86 memory operand can only compute the sum of two values, so we cannot compute the sum of these three values inside of a memory operand.
However, we can deduplicate some computations because we are fetching multiple values from the same row/column of data in each data stream.

The full proof that this is happening is that…

Of the 4 pointers r12, r13, r14 and rcx maintained by the core loop logic, we know that r13 and rcx are the output pointers because they are only used when storing data and (rcx-only) checking if the loop is over. This means that r12 and r14 are the pointers to the topleft corner of our U and V input windows (we don’t know which is which).
In mul operations, which we know from the source code to manipulate data from the first row of each 3x3 window of inputs, we see that the two input pointers r12 and r14 are used as-is and also summed with two values r10 and r15 that stay the same for input addresses associated with the U and V windows. This tells us that these values contain the offset of the central and right column of data with respect to the topleft corner of an input window.
Our lea operations are identical to the nontrivial address calculations used for mul operations. This tells us that the output of each lea represents the memory address of the value at the top of the central or right column of each input window. By combining those with the r12 and the r15 pointers to the top of the left column, we get a total of 6 pointers that each represent the top of a column within the input window for one chemical species.
Each result from an lea is later summed with two other addends, stored in rbx and rbp as part of an x86 memory operand to produce the input address of two FMA operations. We know that FMAs are used to add values from consecutive rows within a column of inputs, which tells us that these two quantities are row offsets. And indeed the same addends are each used for 6 different FMA address calculations, which makes sense because our U and V input data have the same row offset and we are summing 3 values within each row of each input.

There are also a couple of register-register move instructions. We will explain their purpose later. For now it suffices to say that these instructions are also almost free to process on the CPU backend side. Their only cost is on the CPU frontend side, which in inner loops like this is essentially bypassed by the faster micro-operation cache. So in such a small number, they are highly unlikely to have any measurable impact on our program’s runtime performance.

 1971 │168:   mov          r10,r15
              # ... other instructions ...
 2049 │       mov          rbx,rbp

And finally, we have some bitshift instructions, which perform the equivalent of multiplying the number stored inside of a register by 32 or 64, but faster…

 2006 │       shl          r10,0x5
              # ... other instructions ...
 3900 │       shl          rbx,0x5
              # ... other instructions ...
 2126 │       shl          r15,0x6
              # ... other instructions ...
 1972 │       shl          rbp,0x6

…but those are a little more worrying to see around because they come from those ugly memory loads that we mentioned earlier.

For example, if we take the last integer load from the loop…

 1970 │       mov          rbp,QWORD PTR [rsp+0x10]

…a copy of it is multiplied by 32…

 2049 │       mov          rbx,rbp
             # ... other instructions ...
 3900 │       shl          rbx,0x5

…while the original value is multiplied by 64…

 1972 │       shl          rbp,0x6

…and this gives us the rbx and rbp row offsets that are later used as the center and bottom row offsets of all input address calculations (see above).

Similarly, if we take previous integer load at the end of the loop…

 1946 │       mov          r15,QWORD PTR [rsp+0x8]

…a copy of it is multiplied by 32…

 1971 │168:   mov          r10,r15
              # ... other instructions ...
 2006 │       shl          r10,0x5

…while the original value is multiplied by 64…

 2126 │       shl          r15,0x6

…and this gives us the r10 and r15 column offsets that are later used as the center and right column offsets of all input address calculations (see above).

The reason why it is worrying to see this code is that the compiler should not be computing these offsets on every loop iteration. It should compute them once and reuse them across all iterations.

If the compiler does not manage to do this, it means that it failed to prove that the row/column stride of the input ArrayViews does not change between loop iterations. And as a result, the generated code has to keep doing the following…

Load the array row/column stride, which ndarray counts as a number of elements
Multiply it by the element size, which is 32 bytes for AVX SIMD vectors, to get offset or each row/column of the 3x3 input window in bytes.
Multiply the same stride by 64, which is two times the element size, to get the offset between the first and last row/column in bytes.

Isn’t there a way we could do better?

Fixing the problem

All of the above actually happens due to a tiny instruction at the top of the assembly snippet, which is not exercised a lot by the CPU and therefore not highlighted by perf annotate:

    2 │     → call grayscott_exercises::stencil_iter

Even though this code is not called often, the presence of this instruction means that the Rust compiler did not inline the constructor of our custom iterator into the update() function, and therefore had to perform some optimization passes without knowing what this function does.

As a result, the compiler’s optimization passes did not know the initial state of the iterator as perfectly as it could have, and because of this they had to take some pessimistic assumptions, resulting in the poor code generation discussed above.

The author of this course tried to warn the compiler about the danger of not inlining iterator code by the #[inline] directive, which increases the compiler’s tendency to inline a particular function…

#[inline]  // <- This should have helped
fn stencil_iter<'data, const SIMD_WIDTH: usize>(
    // [ ... ]

…but alas, the compiler did not take this gentle hint. So we are going to need to switch to a less gentle hint instead:

#[inline(always)]  // <- This is guaranteed to help
fn stencil_iter<'data, const SIMD_WIDTH: usize>(
    // [ ... ]

With this single line change, we get much more optimal generated code, which contains exactly 6 non-SIMD instructions on the inner loop path: 4 pointer additions and one end-of-loop comparison + branch duo. That’s the way it should be.

Exercise

Implement this optimization and measure its performance impact on your machine. You should not expect a huge benefit because we are not bottlenecked by these integer operations, but it should still be a nice little performance boost worth taking.

Keyboard shortcuts

Numerical Computing with Rust on CPU