Subnormal issues

As you may have noticed during the last chapter, some optimizations benefit performance to a hardware-dependent degree. This is particularly true of one particular optimization that we have not carried out yet, which will provide a tremendous performance gain on Intel hardware, and a much more modest benefit on other hardware. That optimization is subnormal number flushing.

A subnormal math primer

The IEEE-754 floating-point number standard specifies that binary floating point formats such as binary32 (also known as f32 or float) and binary64 (also known as f64 or double) are encoded as bitfields composed of three parts:

A single sign bit which states whether the number is positive or negative.¹
A set of exponent bits which provide rough magnitude information (the smaller the exponent, the smaller the number).
A set of fraction bits which provide most significant digits of the number.

To be precise, a binary IEEE-754 number has binary representation seeeeeffffffffff where…

s denotes the sign bit
e denotes an exponent bit
f denotes a fraction bit

…and it will usually be interpreted as \( (-1)^s \times 2^{eeeee-BIAS} \times 1.ffffffffff \) where BIAS is a constant chosen such that \( 2^0 = 1 \) falls roughly in the middle of the representable exponent range. Numbers that follow this rule are called normal numbers.

This binary representation has one benefit, which is that the resulting number has one more significant binary digit than the number of digits that get actually stored in CPU registers and other storage devices. But there is a price to pay for this cleverness, which is that there is one very important and widely used mathematical number that cannot be encoded, namely zero.

Zero is therefore handled using a special case. When eeeee is 00000, the rule changes, and the remaining bits are now interpreted as \( (-1)^s \times 2^{-BIAS} \times 0.ffffffffff \), with a leading zero in the last term of the product.²

Numbers that follow this alternate rule are called subnormal numbers, and handling a large set of tiny numbers in this way instead of only special-casing zero makes floating-point numbers behave a little bit more nicely from a mathematical perspective (e.g. they follow the Sterbenz Lemma like mathematical real numbers instead of violating it).

But sadly hardware manufacturers hate this weird trick, and they will often pessimize operations to some degree, which is manufacturer-dependent. Yet Intel stand out from the competition by providing a subnormal multiplication and FMA implementation whose peak throughput no less than ~100x smaller than that of the equivalent implementation on normal numbers.

And the reason why we care about all this is that any numerical solution of a differential equation that features spatial diffusion or exponential time decay (both of which exist in our Gray-Scott system) may eventually and durably end up manipulating numbers that fall in the subnormal range. Which means that as currently written, our Gray-Scott reaction simulation has highly suboptimal runtime performance on Intel CPUs.

An easy workaround?

No competent hardware manufacturer would introduce a performance trap of such an epic magnitude without also providing an ugly way for developers to work around it. In the case of Intel’s x86 ISA this ugly workaround comes in the form of two flags in the MXCSR CPU control register:

FTZ, which stands for Flush To Zero, and forces any arithmetic operation that should output a subnormal number to output zero instead.
DAZ, which stands for Denormals Are Zero, and forces arithmetic operations to treat any subnormal input as if it were zero.

Together, these flags effectively disable subnormal arithmetic support, which eliminates the aforementioned overhead and brings CPU arithmetic performance back to vendor marketing material levels when subnormal numbers get involved… at the expense of making some intermediary results much less precise, possibly to the point of turning the final result into meaningless garbage for some categories of floating-point computations. Which, thankfully, our Gray-Scott reaction simulation does not belong to.

But the fun does not end there. All popular optimizing compilers like to assume that CPUs ALUs are configured in the default way, which is an assumption that is also encoded in the C ABI of all popular operating system. This means that in all popular AoT- and JiT-compiled languages, manipulating CPU state that affects ALU semantics like MXCSR is Undefined Behavior. It may have arbitrarily bad consequences that affect the entire application, and even if you get lucky and it does end up working today, it still has a chance to break if you ever update your compiler or any other library of your software stack.³

A way to disable this assumption may be provided, as in the case of C/++ where the -ffast-math option of GCC and clang has this effect among many others⁴… but that “among many others” part is actually problematic. To get permission to manipulate MXCSR from the compiler, you must also enable a lot of other numerically unstable compiler optimizations, that may each turn your application’s numerical output into meaningless garbage today… or provide the illusion of working fine on the test cases that you have today, then break your numerical stability the next time you update your compiler or execute your code in an unusual input configuration.

Also, even when you do enable -ffast-math on your side, you still must be very careful not to call into any third-party library that uses the C ABI without the -ffast-math flag while your MXCSR CPU register is in a non-default state. Otherwise the Undefined Behavior demon will be unleashed again and Arbitrarily Bad Things may happen.

Facing this mess, authors of the Rust compiler decided to punt on this issue for an indefinite amount of time until an Obviously Better Solution is found. Which means that at the moment, there is no way to enable the FTZ and DAZ flags of the MXCSR register in Rust without triggering Undefined Behavior at the LLVM and C ABI layer.

Overall, this means that as of today we have two bad options when facing a subnormal math performance problem in Rust.

Rewrite your computation as an assembly snippet that sets up MXCSR appropriately at the beginning and resets it to its initial state at the end.
Accept the UB hazard in a scope that is as narrow as possible, then make blood sacrifices to the optimizer gods every day in the hope that rustc’s LLVM backend, which does not exploit this particular flavor of UB that much today, will not ever end up altering the deal any farther.⁵

To be honest, in production code, you shouldn’t use either of these options as they both massively degrade the maintainability and portability of your code. Instead you should encourage your local computing cluster to stop purchasing Intel CPUs until they finally fix this longstanding hardware bug. If every other major CPU manufacturer managed to have a subnormal math overhead that’s tiny compared to Intel’s (more like 2x in the very worst case), Intel can do it too.

But just for the fun of it, and also for the sake of fair comparison with the execution times of other Gray-Scott school teachers and compilers⁶ who will carelessly enable -ffast-math-like constructs without even telling you about the underlying dangers, we are going to embrace the Undefined Behavior dark side in this chapter and show you what the second option looks like.

Do not try this in production

We can set up an RAII guard that disables denormal math for the duration of a particular program scope with the following code:

#![allow(unused)]
fn main() {
/// **DANGER:** This abstraction invokes UB and is exposed for the insane fun
///             of it only. Do not ever try this in a production program.
///
/// On x86_64, it momentarily disables subnormal math using forbidden MXCSR
/// operations until the current scope is exited. On other CPUs it does nothing.
struct DenormalsGuard {
    old_mxcsr: u32,
}
//
impl DenormalsGuard {
    /// Set up the forbidden denormals-disabling magic
    fn new() -> Self {
        #[expect(deprecated, reason = "Deprecated due to UB, which is exactly what our foolish selves are looking for here")]
        #[cfg(target_arch = "x86_64")]
        unsafe {
            use std::arch::x86_64;
            let old_mxcsr = x86_64::_mm_getcsr();
            // Set FTZ flag (0x8000) and DAZ flag (0x0040)
            x86_64::_mm_setcsr(old_mxcsr | 0x8040);
            Self { old_mxcsr }
        }
        #[cfg(not(target_arch = "x86_64"))]
        Self { old_mxcsr: 0 }
    }
}
//
impl Drop for DenormalsGuard {
    fn drop(&mut self) {
        #[expect(deprecated, reason = "Deprecated due to UB, which is exactly what our foolish selves are looking for here")]
        #[cfg(target_arch = "x86_64")]
        unsafe {
            std::arch::x86_64::_mm_setcsr(self.old_mxcsr);
        }
    }
}
}

And we can vainly try to minimize the LLVM optimizer’s anger by constructing this guard on a scope that is as narrow as possible without spending an unacceptable amount of time getting and setting the value of the MXCSR register. The start of the update() function looks like the right place here.

let _disable_denormals = DenormalsGuard::new();

…and that will greatly improve our performance on Intel hardware and slightly improve it on AMD hardware… at the unacceptable future maintainability cost of adding UB to our program, which is what Rust vainly aimed to save us for (just count how many times we are explicitly disabling a language UB-safety feature in the code above).

Isn’t there another way?

Oh yes there are. Plenty of ways in fact. But unfortunately none of them are as general-purpose as the MXCSR approach, so you will need to spend a bit more time thinking about what specific problem you are trying to solve and see if any of these approaches apply:

On current AMD CPUs, multiplication is affected by subnormals but FMA isn’t. It is therefore possible to eliminate the overhead of subnormal math at the expense of a slight increase in instruction execution latency by replacing every performance-critical of a * b with a.mul_add(b, 0.0). There are just two problems with this strategy:
- Unless some IEEE-754 edge case that the author forgot about makes it illegal, the compiler is likely to notice that these two operations produce the same result and “helpfully” optimize the latter into the former someday.
- Most importantly, the overhead of subnormal multiplication is much smaller, and arguably acceptable, on AMD hardware. Intel hardware is the real issue here, and unfortunately their FMA units are just as subnormals-impaired as the MUL ones.
Some audio- and image-processing codebases add a tiny amount of random noise to their intermediary results and outputs, whose magnitude is chosen to keep said results out of the subnormal range. This trick may have acceptable overhead if you are doing lots of computations between two points where subnormals may appear, enough to amortize the costs of RNG and addition, but it does come at the expense of some precision loss.
Our program could also detect subnormal (or soon-to-be-subnormal) values and manually turn them into zeros. This may sound expensive if you think about naively…
- Taking the absolute value of every element in the current SIMD vector
- Checking if it’s below some safety margin above the subnormal range
- …if not, conditionally assigning it to zero
…but thankfully there is a more SIMD-friendly way to do this with masking:
```
use std::simd::prelude::*;

// Flush overly small values to zero
fn flush_to_zero<const WIDTH: usize>(x: Simd<f32, WIDTH>) -> Simd<f32, WIDTH>
where
    LaneCount<WIDTH>: SupportedLaneCount,
{
    const THRESHOLD: f32 = 10.0 * f32::MIN_POSITIVE;
    x.abs()
        .simd_lt(Simd::splat(THRESHOLD))
        .select(Simd::splat(0.0), x)
}
```
Or, you know, Intel could just fix their broken hardware someday, like pretty much everyone else in the CPU industry did ages ago… But good luck convincing them to do that.

Exercises

If you have the misfortune of running your code on an Intel CPU, challenge the optimizer gods by adding a DenormalsGuard to your program, and assess how that affects runtime performance.

Then try to see if you can get similar performance in an UB-free and hardware-agnostic manner through cautious use of the flush_to_zero primitive defined above, bearing in mind that…

The most obvious place to do such flushing is that where new values of u and v are assigned, as those values will later get reused as inputs to the next computation and you know that they should have a magnitude around unity.
Because of nonlinear terms like uv_square in the computation, the example threshold given above may not be adequate. You may want to fine-tune a better threshold, through binary search across orders of magnitude if necessary.
If you tune the threshold too high, you may get an over-eager flush to zero resulting in visible output artifacts. So you will want to check the output images from time to time when tuning the threshold up.

…to which we owe the greatness of -0.0. ↩
…which is known as the number’s mantissa. ↩
To provide an example of a way in which this particular flavor of Undefined Behavior could break, consider what would happen if LLVM or a library that you are using added an optimization that momentarily modifies the value of the MXCSR register, under the assumption that it initially has the expected state specified by the C ABI, then restored that register back to its expected value. ↩
Technically, there are also #pragma STDC directives for that, but good luck finding a compiler where these directives actually do anything instead of being silently ignored. ↩
This sentence is not technically accurate as the whole point of Undefined Behavior is that there never was a deal to begin with, but the meme placement opportunity was too good to resist. ↩
Whom the author will not name and shame as the nice person that he is, except for the Intel compiler whose numerically unstable optimizations are exceptionally egregious. ↩