Optimising path tracing with SIMD

04 Jun 2018

Following on from path tracing in parallel with Rayon I had a lot of other optimisations I wanted to try. In particular I want to see if I could match the CPU performance of @aras_p’s C++ path tracer in Rust. He’d done a fair amount of optimising so it seemed like a good target to aim for. To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. I also implemented a live render loop mimicking his.

the final result

My initial unoptimized code was processing 10Mrays/s on my laptop. Aras’s code (with GPGPU disabled) was doing 45.5Mrays/s. I had a long way to go from here!

tl;dr did I match the C++ in Rust? Almost. My SSE4.1 version is doing 41.2Mrays/s about 10% slower than the target 45.5Mrays/s running on Windows on my laptop. The long answer is more complicated but I will go into that later.

Overview

This is potentially going to turn into a long post, so here’s what I’m going to try and cover:

A very quick introduction to SIMD
SIMD support in Rust
Rewriting Vec3 to use SSE2
A segue into floating point in Rust
Preparing data for SIMD with Structure of Arrays
Getting the result out again
Loading aligned data
Writing a SIMD wrapper and supporting AVX2
Final performance results

What is SIMD?

A big motivation for me doing this was to write some SIMD code. I knew what SIMD was but I’d never actually written any, or none of significance at least.

If you are unfamiliar with SIMD, it stands for Single Instruction Multiple Data. What this means is there are registers on most common CPUs which contain multiple values, e.g. 4 floats, that can be processed with a single instruction.

SIMD add

The image above shows adding two SIMD vectors which each contain 4 elements together in the single instruction. This increases the throughput of math heavy code. On Intel SSE2 has been available since 2001, providing 128 bit registers which can hold 4 floats or 2 doubles. More recent chips have AVX which added 256 bit registers and AVX512 which added 512 bit registers.

In this path tracer we’re performing collision detection of a ray against an array of spheres, one sphere at a time. With SIMD we could check 4 or even 8 spheres at a time.

There’s a few good references I’m using for this:

Aras’s path tracing blog series particularly parts 7, 8 and 9
Intel Intrinsics Guide
Rust std::arch Module documentation
Agner Fog’s Instruction tables with latencies, throughput and micro-ops

SIMD support in Rust

Rust SIMD support is available in the nightly compiler and should make it into a stable Rust release soon. There are a couple of related RFCs around SIMD support.

Target feature RFC#2045 adds the ability to:
- determine which features (CPU instructions) are available at compile time
- determine which features are available at run-time
- embed code for different sets of features in the same binary
Stable SIMD RFC#2325 adds support for:
- 128 and 256 bit vector types and intrinsics for Intel chipsets, e.g. SSE and AVX - other instruction sets will be added later
- the ability to branch on target feature type at runtime so you can use a feature if it’s available on the CPU the program is running on

One thing that is not currently being stabilised is cross platform abstraction around different architecture’s SIMD instructions. I’m sticking with SSE for now as I want to get experience with a native instruction set rather than an abstraction.

Another thing worth mentioning is the SIMD intrinsics are all labelled unsafe, using intrinsics is a bit like calling FFI functions, they aren’t playing by Rust’s rules and are thus unsafe. Optimised code often sacrifices a bit of safety and readability for speed, this will be no exception.

Converting Vec3 to SSE2

Most of the advice I’ve heard around using SIMD is just making your math types use it not the way to get performance and that you are better to just write SIMD math code without wrappers. One reason is Vec3 is only using 3 of the available SIMD lanes, so even on SSE you’re only 75% occupancy and you won’t get any benefit from larger registers. Another reason is components in a Vec3 are related but values in SIMD vector lanes have no semantic relationship. In practice what this means is doing operations across lanes like a dot product is cumbersome and not that efficient. See “The Awkward Vec4 Dot Product” slide on page 43 of this GDC presentation.

Given all of the above it’s not surprising that Aras blogged that he didn’t see much of a gain from converting his float3 struct to SIMD. It’s actually one of the last things I implemented. I followed the same post he did on “How to write a maths library in 2016” except for course in Rust rather than C++. My SSE2 Vec3 can be found here. This actually gave me a pretty big boost, from 10Mrays/s to 20.7Mrays/s without any other optimisations. That’s a large gain so why did I see this when Aras’s C++ version only saw a slight change?

I think the answer has to do with my next topic.

Floating point in Rust

I think every game I’ve ever worked on has enabled to the fast math compiler setting, -ffast-math on GCC/Clang or /fp:fast on MSVC. This setting sacrifices IEEE 754 conformance to allow the compiler to optimise floating point operations more aggressively. Agner Fog recently released a paper on NaN propagation which talks about the specifics of what -ffast-math actually enables. There is currently no equivalent of -ffast-math for the Rust compiler and that doesn’t look like something that will change any time soon. There’s a bit of discussion about how Rust might support it, there are currently intrinsics for fast float operations which you can use in nightly but no plans for stable Rust. I found a bit of discussion on the Rust internals forums like this post but it seems like early days.

My theory on why I saw a large jump going from precise IEEE 754 floats to SSE was in part due to the lack of -ffast-math optimisations. This is speculation on my part though, I haven’t done any digging to back it up.

A more well known floating point wart in Rust is the distinction between PartialOrd and Ord traits. This distinction exists because floating point can be Nan which means that it doesn’t support total order which is required by Ord but not PartialOrd. As a consequence you can’t use a lot of standard library functions like sort with floats. This is an ergonomics issue rather than a performance issue but it is one I ran into working on this so I thought I’d mention it. I’ve seen the noisy_float crate recommended as a work around for float ergonomics issues but I haven’t tried it.

Preparing data for SIMD

As I mentioned earlier, SSE2 should allow testing a ray against 4 spheres at a time, sounds simple enough. The original scalar hit test code looks like so:

impl Sphere {
    pub fn hit(&self, ray: &Ray, t_min: f32, t_max: f32) -> Option<RayHit> {
        let oc = ray.origin - self.centre;
        let a = dot(ray.direction, ray.direction);
        let b = dot(oc, ray.direction);
        let c = dot(oc, oc) - self.radius * self.radius;
        let discriminant = b * b - a * c;
        if discriminant > 0.0 {
            let discriminant_sqrt = discriminant.sqrt();
            let t = (-b - discriminant_sqrt) / a;
            if t < t_max && t > t_min {
                let point = ray.point_at_parameter(t);
                let normal = (point - self.centre) / self.radius;
                return Some(RayHit { t, point, normal });
            }
            let t = (-b + discriminant_sqrt) / a;
            if t < t_max && t > t_min {
                let point = ray.point_at_parameter(t);
                let normal = (point - self.centre) / self.radius;
                return Some(RayHit { t, point, normal });
            }
        }
        None
    }
}

Reasonably straight forward, the first five lines are simple math operations before some conditionals to determine if the ray hit or not. To convert this to SSE2 first we need to load the data into the SIMD registers, do the math bit and finally extract which one of the SIMD lanes contains the result we’re after (the nearest hit point).

First we need to splat x, y, and z components of the ray origin and direction into SSE registers. They we need to iterate over 4 spheres at a time, loading 4 x, y and z components of the sphere’s centre points into registers containing 4 x components, 4 y components and so on.

For the ray data use the _mm_set_ps1 intrinsic to splat each ray origin and direction component into across 4 lanes of an SSE register.

// load ray origin
let ro_x = _mm_set_ps1(ray.origin.get_x());
let ro_y = _mm_set_ps1(ray.origin.get_y());
let ro_z = _mm_set_ps1(ray.origin.get_z());
// load ray direction
let rd_x = _mm_set_ps1(ray.direction.get_x());
let rd_y = _mm_set_ps1(ray.direction.get_y());
let rd_z = _mm_set_ps1(ray.direction.get_z());

There are 3 ways that I’m aware of to load the sphere data for SSE:

_mm_set_ps which takes 4 f32 parameters
_mm_loadu_ps which takes an unaligned *const f32 pointer
_mm_load_ps which takes a 16 byte aligned *const f32 pointer

I did a little experiment to see what the assembly generated by these might look like (godbolt):

pub fn set_and_add(a: f32, b: f32, c: f32, d: f32, e: f32, f: f32, g: f32, h: f32) -> __m128 {
  unsafe {
    _mm_add_ps(
        _mm_set_ps(a, b, c, d),
        _mm_set_ps(e, f, g, h))
  }
}

pub fn load_unaligned_and_add(a: &[f32;4], b: &[f32;4]) -> __m128 {
  unsafe {
      _mm_add_ps(
          _mm_loadu_ps(a as *const f32),
          _mm_loadu_ps(b as *const f32))
  }
}

pub fn load_aligned_and_add(a: &[f32;4], b: &[f32;4]) -> __m128 {
  unsafe {
      _mm_add_ps(
          _mm_load_ps(a as *const f32),
          _mm_load_ps(b as *const f32))
  }
}

And the generated assembly:

playground::set_and_add:
  unpcklps	%xmm0, %xmm1
  unpcklps	%xmm2, %xmm3
  movlhps	%xmm1, %xmm3
  unpcklps	%xmm4, %xmm5
  unpcklps	%xmm6, %xmm7
  movlhps	%xmm5, %xmm7
  addps	%xmm3, %xmm7
  movaps	%xmm7, (%rdi)
  movq	%rdi, %rax
  retq

playground::load_unaligned_and_add:
  movups	(%rsi), %xmm0
  movups	(%rdx), %xmm1
  addps	%xmm0, %xmm1
  movaps	%xmm1, (%rdi)
  movq	%rdi, %rax
  retq

playground::load_aligned_and_add:
  movaps	(%rsi), %xmm0
  addps	(%rdx), %xmm0
  movaps	%xmm0, (%rdi)
  movq	%rdi, %rax
  retq

The first option has to do a lot of work to get data into registers before it can perform the addps. The last option appears to be doing the least work based on instruction count, the addps can take one of it’s operands as a memory address because it’s aligned correctly.

My initial implementation was an array of Sphere objects. The x values for 4 spheres are not in contiguous memory, so I would be forced to use _mm_set_ps to load the data. To use the _mm_loadu_ps I need to store my Sphere data in Structure of Arrays (SoA) format. SoA means instead an array of individual Sphere objects I have a struct containing arrays for each Sphere data member. My SpheresSoA struct looks like so:

pub struct SpheresSoA {
    centre_x: Vec<f32>,
    centre_y: Vec<f32>,
    centre_z: Vec<f32>,
    radius_sq: Vec<f32>,
    radius_inv: Vec<f32>,
    len: usize,
}

There are other advantages to SoA. In the above structure I’m storing precalculated values for radius * radius and 1.0 / radius as if you look at the original hit function radius on it’s own is never actually used. Before switching to SoA adding precalculated values to my Sphere struct actually made things slower, presumably because I’d increased the size of the Sphere struct, reducing my CPU cache utilisation. With SpheresSoA if there is no ray hit result then the radius_inv data will never be accessed, so it’s not wastefully pulled into CPU cache. Switching to SoA did speed up my scalar code too although I didn’t note down the performance numbers at the time.

Now to iterator over four spheres at a time my loop looks like:

// loop over 4 spheres at a time
for (((centre_x, centre_y), centre_z), radius_sq) in self
  .centre_x
  .chunks(4)
  .zip(self.centre_y.chunks(4))
  .zip(self.centre_z.chunks(4))
  .zip(self.radius_sq.chunks(4))
{
  // load sphere centres
  let c_x = _mm_loadu_ps(centre_x.as_ptr());
  let c_y = _mm_loadu_ps(centre_y.as_ptr());
  let c_z = _mm_loadu_ps(centre_z.as_ptr());
  // load radius_sq
  let r_sq = _mm_loadu_ps(radius_sq.as_ptr());
  // ... everything else
}

Because I’m iterating in chunks of 4 I make sure the arrays sizes are a multiple of 4, adding some bogus spheres to the end to pad the array out of necessary. I did try using get_unchecked on each array instead of chunks(4) and zip but at the time it appeared to be slower. I might try it again though as the iterator version is a bit obtuse and generates a lot of code. Note that I’m using _mm_loadu_ps because my Vec<f32> arrays are not 16 byte aligned. I’ll talk about that later.

Getting the result out again

I won’t paste the full SSE2 (actually I’m using a little SSE4.1) ray spheres intersection test source here because it’s quite verbose, but it can be found here if you’re interested. I mostly wrote this from scratch using the Intel Intrinsics Guide to match my scalar code to the SSE2 equivalent and using Aras’s post on SSE sphere collision and subsequent post on ryg’s optimisations as a guide. I think I had got pretty close to completion at the end of one evening but wasn’t sure how to extract the minimum hit result out of the SSE vector so just went for the scalar option:

// copy results out into scalar code to get return value (if any)
let mut hit_t_array = [t_max, t_max, t_max, t_max];
_mm_storeu_ps(hit_t_array.as_mut_ptr(), hit_t);
// this is more complicated that it needs to be because I couldn't
// just use 'min` as f32 doesn't support `Ord` the trait
let (hit_t_lane, hit_t_scalar) = hit_t_array.iter().enumerate().fold(
    (self.len, t_max),
    |result, t| if *t.1 < result.1 { (t.0, *t.1) } else { result },
);
if hit_t_scalar < t_max {
  // get the index of the result in the SpheresSoA vectors and return it
}

All the above code is doing is finding the minimum value in the hit_t SIMD lanes to see if we have a valid hit. Aras’s code what able to do this in SSE2 with his h_min function (h standing for horizontal meaning across the SIMD register lanes), but I wanted to understand how it was working before I used it. The main reason I didn’t understand h_min is I’d never used a shuffle before. I guess they’re familiar to people who write a lot of shaders since swizzling is pretty common but I haven’t done a lot of that either.

My Rust hmin looks like:

v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _mm_shuffle!(0, 0, 3, 2)));
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _mm_shuffle!(0, 0, 0, 1)));
_mm_cvtss_f32(v)

What this is doing is combining _mm_shuffle_ps and _mm_min_ps to compare all of the lanes with each other and shuffle the minimum value into lane 0 (the lowest 32 bits of the vector).

The shuffle is using a macro that encodes where lanes are being shuffled from, it’s a Rust implementation of the _MM_SHUFFLE macro found in xmmintrin.h:

macro_rules! _mm_shuffle {
    ($z:expr, $y:expr, $x:expr, $w:expr) => {
        ($z << 6) | ($y << 4) | ($x << 2) | $w
    };
}

So for example to reverse the order of the SIMD lanes you would do:

_mm_shuffle_ps(a, a, _mm_shuffle!(0, 1, 2, 3))

Which is saying move lane 0 to lane 3, lane 1 to lane 2, lane 2 to lane 1 and lane 3 to lane 0. Lane 3 is the left most lane and lane 0 is the rightmost, which seems backwards but I assume it’s due to x86 being little endian. As you can see _mm_shuffle_ps takes two operands, so you can interleave lanes from two values into the result but I haven’t had a use for that so far.

So given that, if we put some values into hmin:

let a = _mm_loadu_ps([A, B, C, D]);
// a = [A, B, C, D]
let b = _mm_shuffle_ps(a, a, _mm_shuffle!(0, 0, 3, 2));
// b = [D, D, A, B];
let c = mm_min_ps(a, b);
// c = [min(A, D), min(B, D), min(C, A), min(D, B)];
let d = _mm_shuffle_ps(c, c, _mm_shuffle!(0, 0, 0, 1));
// d = [min(A, D), min(A, D), min(A, D), min(C, A)];
let e = _mm_min_ps(c, d);
// e = [min(A, D), min(A, B, D), min(A, C, D), min(A, B, C, D)]
let f = _mm_cvtss_f32(e);
// f = min(A, B, C, D)

The result in lane 0 has been compared with all of the values in the 4 lanes. I thought this was a nice example of the kind of problems that need to be solved when working with SIMD.

Aligning data

As I mentioned earlier that the best way that I’m aware of to load data into SSE registers is with a f32 array aligned to 16 bytes so you can use the _mm_load_ps intrinsic. Unfortunately Rust’s std::Vec doesn’t support custom allocators so there’s no way to specify a specific alignment for my Vec<f32>.

Another way would be to store an aligned arrays of floats that are the same size as our SIMD vector. I did this with a [repr(align(16)] wrapper around a float array which looks like:

pub mod m128 {
    #[repr(C, align(16))]
    pub struct ArrayF32xN(pub [f32; 4]);
}

pub mod m256 {
    #[repr(C, align(32))]
    pub struct ArrayF32xN(pub [f32; 8]);
}

I actually just wanted to use [repr(align(n))] because I implemented it in the compiler and I’d never had an excuse to use it. Probably it would be easier to make a union of an f32 array and the SIMD vector type.

In any case my SpheresSoA now looks like:

pub struct SpheresSoA {
    centre_x: Vec<ArrayF32xN>,
    centre_y: Vec<ArrayF32xN>,
    centre_z: Vec<ArrayF32xN>,
    radius_sq: Vec<ArrayF32xN>,
    radius_inv: Vec<ArrayF32xN>,
    num_chunks: u32,
    num_spheres: u32,
}

So I can remove the chunks(4) from my ray spheres intersection loop as they spheres are already in ArrayF32xN sized chunks.

SIMD wrapper with AVX2 support

I had intended to write a wrapper later after getting some hands on experience with SSE intrinsics. One reason being that

let discr = nb * nb - c;

Is a lot easier to read and write than

let discr = _mm_sub_ps(_mm_mul_ps(nb, nb), c);

The other reason was I had noticed that mostly the SSE code for the ray spheres collision wasn’t specific to the number of lanes. So I thought if I wrote a wrapper then it wouldn’t be that difficult to add AVX support which has 8 vectors in a float but still have exactly the same collision code.

The way I implemented the wrapper was inside different modules for the 128 bit, 256 bit and 32 bit scalar fallback versions. Then I just use the one that is available for target features enabled by the compiler. This means I’m using compile time target feature selection but not runtime target feature selection. I haven’t worked out exactly how I was to implement the runtime version yet.

In a very abridged version for example I have the following modules:

pub mod simd {
  #[cfg(target_feature = "sse2")]
  pub mod m128 {
    pub struct f32xN(pub __m128);
  }

  #[cfg(target_feature = "avx2")]
  pub mod m256 {
    pub struct f32xN(pub __m256);
  }

  pub mod m32 {
    pub struct f32xN(pub f32);
  }

  // re-export fallback scalar code
  #[cfg(not(any(target_feature = "sse2", target_feature = "avx2")))]
  pub use self::m32::*;

  // re-export sse2 if no avx2
  #[cfg(all(target_feature = "sse2", not(target_feature = "avx2")))]
  pub use self::m128::*;

  // re-export avx2
  #[cfg(target_feature = "avx2")]
  pub use self::m256::*;
}

So user code can use simd::f32xN and it will use the widest SIMD registers available to the compiler. I have a bunch of other types I’m creating like ArrayF32xN which I use to align data for loading into f32xN, there’s i32xN for integer SIMD and b32xN for boolean values. I haven’t wrapped every possible instruction, just what I was using for ray sphere intersections. I’m not using traits for these to define a common interface for the different bit sizes although I possibly should.

My wrapper code can be found here and the ray sphere intersection test using it is here.

For x86_64 targets rustc enables SSE2 by default. There are a couple of different ways to enable other target features, one is RUSTFLAGS and the other is .cargo/config. The latter was most useful for me as it meant I didn’t need to either set flags globally or remember to set them on every invocation of the compiler. My .cargo/config to enable AVX2 looks like this:

[target.'cfg(any(target_arch = "x86", target_arch = "x86_64"))']
rustflags = ["-C", "target-feature=+avx2"]

Final performance results

I primarily testing performance on my laptop running Window 10 home. I’m compiling with cargo build --release of course with rustc 1.28.0-nightly (71e87be38 2018-05-22). My performance numbers for each iteration of my path tracer are:

Feature	Mrays/s
Aras’s C++ SSE4.1 version	45.5
My starting point	10.0
SSE2 Vec3	20.7
Plain SSE4.1	38.1
Wrapped SSE4.1	38.7
Wrapped AVX2	41.9
Wrapped SSE4.1 w/ LTO	41.2
Wrapped AVX2 w/ LTO	45.0

The difference between the ‘Plain SSE4.1’ and ‘Wrapped SSE4.1’ are because I added the hmin changes and aligned SpheresSoA data at the same time as I added the SIMD wrappers. LTO stands for link time optimisation. It did give a nice increase in performance, but at a large increase in build time.

I do most of my development on Linux and was confused for a long time because my Linux performance was much higher than what I was seeing in Windows on the same laptop. At the same time the Linux peaks were higher but it fluctuated a lot as the path tracer ran. The Windows numbers were lower but stable. This difference is very likely due to CPU frequency scaling differences between Linux and Windows. It does make benchmarking on my Linux laptop a bit unreliable though.

I did occasionally test on a desktop machine too, which gave the following results:

Feature	Mrays/s
Aras’s C++ SSE4.1 version	47.7
Wrapped SSE4.1	43.0
Wrapped AVX2	47.0
Wrapped SSE4.1 w/ LTO	45.8
Wrapped AVX2 w/ LTO	51.0

So in this case my AVX2 version was faster, which is nice but not really an apples versus apples comparison.

So what differences are remaining? A couple of things stood out in my profiling:

The C++ compiler appeared to be linking in an SSE2 implementation of sin and cos, Rust is using the f64 sin and cos in the MSVC toolchain version of the compiler

Rayon was higher overhead than Enki (the C thread pool Aras is using). There were some pretty callstacks in Rayon and the hot function that was showing up in VTune was pretty large. Enki by comparison is pretty simple.

Addressing these things might get me a bit closer.

It’s also a bit surprising that AVX2 didn’t make a huge difference, it seemed to be bottlenecked on performing square roots.

If you made it this far well done, thanks for reading! I covered a lot in the post but at the same time I left out a lot of details. If you want to know more take a look my spheres_simd_wrapped branch which has my most optimised code and feel free to ask questions or give feedback via /r/rust post on Reddit.