Gray-Scott introduction
We are now ready to introduce the final boss computation of this course: the
Gray-Scott reaction simulation. In this chapter, you will be taken through a
rapid tour of the pre-made setup that is provided to you for the purpose of
input initialization, kernel benchmarking, and output HDF5 production. We will
then conclude this tour by showing how one would implement a simple, unoptimized
version of the simulation using ndarray.
Along the way, you will also get a quick glimpse of how Rust’s structs and methods work. We did not get to explore this area of the language to fit the course’s short format, but in a nutshell they can be used to implement encapsulated objects like C++ classes, and that’s what we do here.
Input initialization
In the Gray-Scott school that this course originates from, the reference C++ implementation of the Gray-Scott simulation would hardcode the initial input state of the simulation. For the sake of keeping things simple and comparable, we will do the same:
use ndarray::Array2;
/// Computation precision
///
/// It is a good idea to make this easy to change in your programs, especially
/// if you care about hardware portability or have ill-specified output
/// precision requirements.
///
/// Notice the "pub" keyword for exposing this to the outside world. All types,
/// functions... are private to the current code module by default in Rust.
pub type Float = f32;
/// Storage for the concentrations of the U and V chemical species
pub struct UV {
pub u: Array2<Float>,
pub v: Array2<Float>,
}
//
/// impl blocks like this let us add methods to types in Rust
impl UV {
/// Set up the hardcoded chemical species concentration
///
/// Notice the `Self` syntax which allows you to refer to the type for which
/// the method is implemented.
fn new(num_rows: usize, num_cols: usize) -> Self {
let shape = [num_rows, num_cols];
let pattern = |row, col| {
(row >= (7 * num_rows / 16).saturating_sub(4)
&& row < (8 * num_rows / 16).saturating_sub(4)
&& col >= 7 * num_cols / 16
&& col < 8 * num_cols / 16) as u8 as Float
};
let u = Array2::from_shape_fn(shape, |(row, col)| 1.0 - pattern(row, col));
let v = Array2::from_shape_fn(shape, |(row, col)| pattern(row, col));
Self { u, v }
}
/// Set up an all-zeroes chemical species concentration
///
/// This can be faster than `new()`, especially on operating systems like
/// Linux where all allocated memory is guaranteed to be initially zeroed
/// out for security reasons.
fn zeroes(num_rows: usize, num_cols: usize) -> Self {
let shape = [num_rows, num_cols];
let u = Array2::zeros(shape);
let v = Array2::zeros(shape);
Self { u, v }
}
/// Get the number of rows and columns of the simulation domain
///
/// Notice the `&self` syntax for borrowing the object on which the method
/// is being called by shared reference.
pub fn shape(&self) -> [usize; 2] {
let shape = self.u.shape();
[shape[0], shape[1]]
}
}
Double buffering
The Gray-Scott reaction is simulated by updating the concentrations of the U and V chemical species many times. This is done by reading the old concentrations of the chemical species from one array, and writing the new concentrations to another array.
We could be creating a new array of concentrations every time we do this, but this would require performing one memory allocation per simulation step, which can be expensive. Instead, it is more efficient to use the double buffering pattern. In this pattern, we keep two versions of the concentration in an array, and on every step of the simulation, we read from one of the array slots and write to the other array slot. Then we flip the role of the array slots for the next simulation step.
We can translate this pattern into a simple encapsulated object…
/// Double-buffered chemical species concentration storage
pub struct Concentrations {
buffers: [UV; 2],
src_is_1: bool,
}
//
impl Concentrations {
/// Set up the simulation state
pub fn new(num_rows: usize, num_cols: usize) -> Self {
Self {
buffers: [UV::new(num_rows, num_cols), UV::zeroes(num_rows, num_cols)],
src_is_1: false,
}
}
/// Get the number of rows and columns of the simulation domain
pub fn shape(&self) -> [usize; 2] {
self.buffers[0].shape()
}
/// Read out the current species concentrations
pub fn current(&self) -> &UV {
&self.buffers[self.src_is_1 as usize]
}
/// Run a simulation step
///
/// The user callback function `step` will be called with two inputs UVs:
/// one containing the initial species concentration at the start of the
/// simulation step, and one to receive the final species concentration that
/// the simulation step is in charge of generating.
///
/// Notice the `&mut self` syntax for borrowing the object on which
/// the method is being called by mutable reference.
pub fn update(&mut self, step: impl FnOnce(&UV, &mut UV)) {
let [uv_0, uv_1] = &mut self.buffers;
if self.src_is_1 {
step(uv_1, uv_0);
} else {
step(uv_0, uv_1);
}
self.src_is_1 = !self.src_is_1;
}
}
…and then this object can be used to run the simulation like this:
// Set up the concentrations buffer
let mut concentrations = Concentrations::new(num_rows, num_cols);
// ... other initialization work ...
// Main simulation loop
let mut running = true;
while running {
// Update the concentrations of the U and V chemical species
concentrations.update(|start, end| {
// TODO: Derive new "end" concentration from "start" concentration
end.u.assign(&start.u);
end.v.assign(&start.v);
});
// ... other per-step action, e.g. decide whether to keep running,
// write the concentrations to disk from time to time ...
running = false;
}
// Get the final concentrations at the end of the simulation
let result = concentrations.current();
println!("u is {:#?}", result.u);
println!("v is {:#?}", result.v);
HDF5 output
The reference C++ simulation lets you write down the concentration of the V chemical species to an HDF5 file every N computation steps. This can be used to check that the simulation works properly, or to turn the evolving concentration “pictures” into a video for visualization purposes.
Following its example, we will use the hdf5 Rust crate, or more precisely its
hdf5-metno fork1, to write data to HDF5 too,
using the same file layout conventions for interoperability. Here too, we will
use an encapsulated object design to keep things easy to use correctly:
use crate::data::{Float, UV};
use hdf5::{Dataset, File};
/// Mechanism to write down results to an HDF5 file
pub struct HDF5Writer {
/// HDF5 file handle
file: File,
/// HDF5 dataset
dataset: Dataset,
/// Number of images that were written so far
position: usize,
}
impl HDF5Writer {
/// Create or truncate the file
///
/// The file will be dimensioned to store a certain amount of V species
/// concentration arrays.
///
/// The `Result` return type indicates that this method can fail and the
/// associated I/O errors must be handled somehow.
pub fn create(file_name: &str, shape: [usize; 2], num_images: usize) -> hdf5::Result<Self> {
// The ? syntax lets us propagate errors from an inner function call to
// the caller, when we cannot handle them ourselves.
let file = File::create(file_name)?;
let [rows, cols] = shape;
let dataset = file
.new_dataset::<Float>()
.chunk([1, rows, cols])
.shape([num_images, rows, cols])
.create("matrix")?;
Ok(Self {
file,
dataset,
position: 0,
})
}
/// Write a new V species concentration table to the file
pub fn write(&mut self, result: &UV) -> hdf5::Result<()> {
self.dataset
.write_slice(&result.v, (self.position, .., ..))?;
self.position += 1;
Ok(())
}
/// Flush remaining data to the underlying storage medium and close the file
///
/// This should automatically happen on Drop, but doing it manually allows
/// you to catch and handle I/O errors properly.
pub fn close(self) -> hdf5::Result<()> {
self.file.close()
}
}
After adding this feature, our simulation code skeleton now looks like this:
// Set up the concentrations buffer
let mut concentrations = Concentrations::new(num_rows, num_cols);
// Set up HDF5 I/O
let mut hdf5 = HDF5Writer::create(file_name, concentrations.shape(), num_output_steps)?;
// Produce the requested amount of concentration tables
for _ in 0..num_output_steps {
// Run a number of simulation steps
for _ in 0..compute_steps_per_output_step {
// Update the concentrations of the U and V chemical species
concentrations.update(|start, end| {
// TODO: Derive new "end" concentration from "start" concentration
end.u.assign(&start.u);
end.v.assign(&start.v);
});
}
// Write down the current simulation output
hdf5.write(concentrations.current())?;
}
// Close the HDF5 file
hdf5.close()?;
Generic runner
Design
Right now, our simulation’s update function is a stub that simply copies the input concentrations to the output concentrations without actually changing them. At some point, we are going to need to change it and compute the real updated chemical species concentrations there.
However, we also know from experience with performance optimization that we will tweak this part of the code a lot. It would be nice if we could do this in a laser-focused function that is only concerned with computational work, so that we do not need to briefly remind ourselves how the buffer allocation and I/O works every time we change the number crunching logic.
Furthermore, another thing that experience teaches us is that we are going to want to measure our performance to see what positive or negative impact our code changes have. And unfortunately, HDF5 I/O is going to get in the way there:
- We can’t just keep writing new simulation steps to HDF5 storage until enough
time has elapsed for us to compute an accurate estimate of the iteration time.
On fast but small storage like
tmpfs, we could run out of storage before we get there. - Unlike computational tasks which are executed directly by the CPU, I/O request go through the operating system kernel and one or more hardware controllers, which all attempt to process them in a clever way. As a result of this cleverness, multiple runs of an I/O bound application may have very different performance even when the application makes identical I/O requests. This “intrinsic” I/O performance variability makes it a lot harder to assess the “extrinsic” performance impact of our application-side changes.
- Storage devices have a huge performance range (for our sequential I/O pattern,
throughput can go from tens of MB/s for distributed filesystems over a slow
network, to tens of GB/s for RAIDs of state-of-the art NVMe SSDs or
tmpfs). During application development, we may not know in advance what kind of storage device we’re eventually going to use, so it is useful to measure and tune our compute performance separately, postponing I/O optimizations until the specifics of the production I/O infrastructure are known.
Knowing this, we will actually want to write two versions of our simulation:
- A complete simulation program that writes down its output to HDF5
- A microbenchmark that performs simulation steps but does not store the results, which we will use to study and optimize our computational performance in isolation.
…but to ease maintenance, these programs should share as much code as possible.
Implementation
All concerns above can be addressed by breaking down our simulation logic in three chunks:
- The update function, which knows how to perform a single simulation step.
Given initial
(U, V)concentrations and a previously allocated(U, V)output buffer, this function computes updated concentrations and stores them in the output buffer. This is where most of our performance optimization work will happen later on. - The image processing callback, which is invoked when a concentration image is produced. This is the part that differs between the microbenchmark and the full simulation: in the microbenchmark it does nothing2, in the full simulation it stores results to an HDF5 file.
- The simulation runner, which puts this all together by allocating concentration buffers and running the simulation for the user-requested number of steps, periodically producing output images. It delegates computational work to the aforementioned update function, and storage I/O (if any) to the aforementioned image processing callback.
We thus get the full simulation and microbenchmark to share the following common runner and update function, parametrized by a user-specified output image processing callback:
use crate::data::Float;
/// Simulation runner parameters
pub struct RunnerOptions {
/// Number of rows in the concentration table
pub num_rows: usize,
/// Number of columns in the concentration table
pub num_cols: usize,
/// Number of output images to be produced
pub num_output_images: usize,
/// Number of computation steps before producing an image
pub steps_per_image: usize,
/// Options controlling simulation updates
pub update: UpdateOptions,
}
/// Result type encoding possible simulation errors
///
/// In the CPU version of the simulation, HDF5 I/O is the only source of errors,
/// so we will hijack HDF5's Result type. But we do so via a typedef so that we
/// can later switch to a more general result type, when we get to GPU computing
/// which has to deal with _many_ other sources of error.
pub type Result<T> = hdf5::Result<T>;
/// Simulation runner, with a user-specified output processing function
pub fn run_simulation(
options: &RunnerOptions,
// Notice that we cannot use FnOnce here because the output processing
// callback will be called multiple times.
mut process_uv: impl FnMut(&UV) -> Result<()>,
) -> Result<()> {
// Set up chemical concentrations storage
let mut concentrations = Concentrations::new(options.num_rows, options.num_cols);
// Produce the requested amount of concentration tables
for _ in 0..options.num_output_images {
// Perform a number of simulation steps
for _ in 0..options.steps_per_image {
concentrations.update(|current, next| update(&options.update, current, next));
}
// Process a concentration table
process_uv(concentrations.current())?;
}
Ok(())
}
/// Simulation update parameters
pub struct UpdateOptions {
/// Speed at which U is added to the simulation and U, V and P are removed
pub feedrate: Float,
/// Speed at which V turns into P
pub killrate: Float,
/// Simulated time interval on each simulation step
pub deltat: Float,
}
/// Perform a simulation step
fn update(_options: &UpdateOptions, _current: &UV, _next: &mut UV) {
// Here we replaced the previous copy with a todo!() to make sure that until
// the proper simulation logic has been written down, program crashes
// instead of silently doing something wrong.
todo!("This needs more code to work...")
}
The full simulation binary uses the image processing callback to save output images to an HDF5 file…
/// Global simulation options
struct Options {
/// Options controlling the simulation runner
runner: RunnerOptions,
/// Output file name
file_name: String,
}
fn main() -> Result<()> {
// TODO: Somehow configure simulation options
// Set up HDF5 I/O
let mut hdf5 = HDF5Writer::create(
&options.file_name,
[options.runner.num_rows, options.runner.num_cols],
options.runner.num_output_images,
)?;
// Run the simulation
run_simulation(&options.runner, |uv| {
// Write down the current simulation output
hdf5.write(uv)?;
Ok(())
})?;
// Close the HDF5 file
hdf5.close()?;
Ok(())
}
…while the microbenchmark merely uses it to give the compiler the illusion that output images are used, so that it cannot optimize out the entire computation.
grayscott_exercises::run_simulation(
&runner_options,
|uv| {
std::hint::black_box(uv);
Ok(())
},
)
Command-line options
Our simulation has a fair of tuning parameters, listed in the RunnerOptions
and UpdateOptions structs introduced above.
We could just hardcode all these parameters, but doing so would anger the gods of software engineering and break feature parity with the reference C++ version. So instead we will make these parameters configurable via command-line parameters whose syntax and semantics strictly match those of the C++ version.
To this end, we can use the excellent clap library,
which provides the best API for parsing command line options that I have ever
seen in any programming language.
The first step, as usual, is to add clap as a dependency to our project. We
will also enable the derive optional feature, which is the key to the
aforementioned nice API:
# You do not need to run this command, it has already been run for you
cargo add --features=derive clap
We will then add some annotations to the definition of our options structs, explaining how they map to the command-line options that our program expects (which follow the syntax and defaults of the C++ reference version for interoperability):
use clap::Args;
/// Simulation runner parameters
#[derive(Args, Debug)]
pub struct RunnerOptions {
/// Number of rows in the concentration table
#[arg(short = 'r', long = "nbrow", default_value_t = 1080)]
pub num_rows: usize,
/// Number of columns in the concentration table
#[arg(short = 'c', long = "nbcol", default_value_t = 1920)]
pub num_cols: usize,
/// Number of output images to be produced
#[arg(short = 'n', long = "nbimage", default_value_t = 1000)]
pub num_output_images: usize,
/// Number of computation steps before producing an image
#[arg(short = 'e', long = "nbextrastep", default_value_t = 32)]
pub steps_per_image: usize,
/// Options controlling simulation updates
#[command(flatten)]
pub update: UpdateOptions,
}
/// Simulation update parameters
#[derive(Args, Debug)]
pub struct UpdateOptions {
/// Speed at which U is added to the simulation and U, V and P are removed
#[arg(short, long, default_value_t = 0.014)]
pub feedrate: Float,
/// Speed at which V turns into P
#[arg(short, long, default_value_t = 0.054)]
pub killrate: Float,
/// Simulated time interval on each simulation step
#[arg(short = 't', long, default_value_t = 1.0)]
pub deltat: Float,
}
We then create a top-level struct which represents our full command-line interface…
use clap::Parser;
/// Gray-Scott reaction simulation
///
/// This program simulates the Gray-Scott reaction through a finite difference
/// schema that gets integrated via the Euler method.
#[derive(Debug, Parser)]
#[command(version)]
struct Options {
/// Options controlling the simulation runner
#[command(flatten)]
runner: RunnerOptions,
/// Output file name
#[arg(short = 'o', long = "output", default_value = "output.h5")]
file_name: String,
}
…and in the main function of our final application, we call the automatically
generated parse() method of that struct and retrieve the parsed command-line
options.
fn main() -> Result<()> {
// Parse command line options
let options = Options::parse();
// ... now do something with "options" ...
}
That’s it. With no extra work, clap will automatically provide our
simulation with a command-line interface that follows all standard Unix
conventions (e.g. supports both --option value and --option=value), handles
user errors, parses argument strings to their respective concrete Rust types,
and prints auto-generated help strings when -h or --help is passed.
Also, if you spend 10 more minutes on it, you can make as many of these options
as you want configurable via environment variables too. Which can be convenient
in scenarios where you cannot receive configuration through CLI parameters, like
inside of criterion microbenchmarks.
Hardcoded parameters
Not all parameters of the C++ reference version are configurable. Some of them are hardcoded, and can only be changed by altering the source code. Since we are aiming for perfect user interface parity with the C++ version, we want to replicate this design in the Rust version.
For now, we will do this by adding a few constants with the hardcoded values to the source code:
#![allow(unused)] fn main() { type Float = f32; /// Weights of the discrete convolution stencil pub const STENCIL_WEIGHTS: [[Float; 3]; 3] = [ [0.25, 0.5, 0.25], [0.5, 0.0, 0.5], [0.25, 0.5, 0.25] ]; /// Offset from the top-left corner of STENCIL_WEIGHTS to its center pub const STENCIL_OFFSET: [usize; 2] = [1, 1]; /// Diffusion rate of the U species pub const DIFFUSION_RATE_U: Float = 0.1; /// Diffusion rate of the V species pub const DIFFUSION_RATE_V: Float = 0.05; }
In Rust, const items let you declare compile-time constants, much like
constexpr variables in C++, parameters in Fortran, and #define STUFF 123
in C. We do not have the time to dive into the associated language
infrastructure, but for now, all you need to know is that the value of a const
will be copy-pasted on each point of use, which ensures that the compiler
optimizer can specialize the code for the value of the parameter of interest.
Progress reporting
Simulations can take a long time to run. It is not nice to make users wait for
them to run to completion without any CLI output indicating how far along they
are and how much time remains until they are done. Especially when it is very
easy to add such reporting in Rust, thanks to the wonderful
indicatif library.
To use it, we start by adding the library to our project’s dependencies…
# You do not need to run this command, it has already been run for you
cargo add indicatif
Then in our main function, we create a progress bar with a number of steps matching the number of computation steps…
use indicatif::ProgressBar;
let progress = ProgressBar::new(options.runner.num_output_images as u64);
…we increment it after saving an image to the output HDF5 file…
progress.inc(1);
…and at the end of the simulation, we tell indicatif that we are done:3
progress.finish();
That’s all we need to add basic progress reporting to our simulation.
Final code layout
This is not a huge amount of code overall, but it does get uncomfortably large
and unfocused for a single code module. So in the exercises Rust project, the
simulation code has been split over multiple code modules.
We do not have the time to cover the Rust module system in this course, but if you are interested, feel free to skim through the code to get a rough idea of how modularization is done, and ask any question that comes to your mind while doing so.
Microbenchmarks can only access code from the main library (below the src
directory of the project, excluding the bin/ subdirectory), therefore most of
the code lies there. In addition, we have added a simulation binary under
src/bin/simulate.rs, and a microbenchmark under benches/simulate.rs.
Exercise
Here is a naïve implementation of a Gray-Scott simulation step implemented using
ndarray:
use crate::options::{DIFFUSION_RATE_U, DIFFUSION_RATE_V, STENCIL_OFFSET, STENCIL_WEIGHTS};
/// Perform a simulation step
fn update(options: &UpdateOptions, start: &UV, end: &mut UV) {
// Species concentration matrix shape
let shape = start.shape();
// Iterate over pixels of the species concentration matrices
ndarray::azip!(
(
index (out_row, out_col),
out_u in &mut end.u,
out_v in &mut end.v,
&u in &start.u,
&v in &start.v
) {
// Determine the stencil's input region
let out_pos = [out_row, out_col];
let stencil_start = array2(|i| out_pos[i].saturating_sub(STENCIL_OFFSET[i]));
let stencil_end = array2(|i| (out_pos[i] + STENCIL_OFFSET[i] + 1).min(shape[i]));
let stencil_range = array2(|i| stencil_start[i]..stencil_end[i]);
let stencil_slice = ndarray::s![stencil_range[0].clone(), stencil_range[1].clone()];
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (start.u.slice(stencil_slice).indexed_iter())
.zip(start.v.slice(stencil_slice))
.fold(
[0.; 2],
|[acc_u, acc_v], (((in_row, in_col), &stencil_u), &stencil_v)| {
let weight = STENCIL_WEIGHTS[in_row][in_col];
[acc_u + weight * (stencil_u - u), acc_v + weight * (stencil_v - v)]
},
);
// Deduce the change in U and V concentration
let uv_square = u * v * v;
let du = DIFFUSION_RATE_U * full_u - uv_square + options.feedrate * (1.0 - u);
let dv = DIFFUSION_RATE_V * full_v + uv_square
- (options.feedrate + options.killrate) * v;
*out_u = u + du * options.deltat;
*out_v = v + dv * options.deltat;
}
);
}
/// Shorthand for creating a 2D Rust array from an index -> value mapping
fn array2<T>(f: impl FnMut(usize) -> T) -> [T; 2] {
std::array::from_fn(f)
}
Please integrate it into the codebase such that it can is used by both the
simulation binary at src/bin/simulate.rs and the microbenchmark at
benches/simulate.rs, without needing to constantly copy and paste code between
the two locations.
Then make sure everything works by running both of them using the following commands:
# Must use -- to separate cargo options from program options
cargo run --release --bin simulate -- -n 5 -e 2
cargo bench --bench simulate
It is expected that the last command will take a few minutes to complete. We are just at the start of our journey, and there’s a lot of optimization work to do. But the set of benchmark configurations is designed to remain relevant by the time where the simulation will be running much, much faster.
Also, starting at this chapter, the exercises are going to get significantly more complex. Therefore, it is a good idea to keep track of old versions of your work and have a way to get back to old versions. To do this, you can turn the exercises codebase into a git repository…
cd ~/exercises
git init
git add --all
git commit -m "Initial commit"
…then save a commit at the end of each chapter, or more generally whenever you feel like you have a codebase state that’s worth keeping around for later.
git add --all
git commit -m "<Describe your code changes here>"
-
The author of the original
hdf5crate sadly ceased maintenance and stopped responding to any contact. The Norwegian Meteorogical Institute akamet.no, which used thehdf5crate, then decided to take over its maintenance. But because the official crates.io Rust package registry does not allow for “hostile takeover” of software without explicit owner permission, they had to push their fork to crates.io under a different name, which is why it is calledhdf5-metno. ↩ -
…besides giving the compiler the illusion that the results are used, which is needed to make sure the computation will not be declared useless by the compiler’s optimizer and optimized out. ↩
-
This step is needed because
indicatifallows you to add more work to the progress bar while the computation is running. This is not so nice from a user’s perspective, but it may be necessary when the total amount of work cannot be discovered ahead of time, or when work can be dynamically added by the user while the computation is already running. ↩