Optimising path tracing: the last 10%

In my last post on optimising my Rust path tracer with SIMD I had got withing 10% of my performance target, that is Aras’s C++ SSE4.1 path tracer. From profiling I had determined that the main differences were MSVC using SSE versions of sinf and cosf and differences between Rayon and enkiTS thread pools. The first thing I tried was implement an SSE2 version of sin_cos based off of Julien Pommier’s code that I found via a bit of googling. This was enough to get my SSE4.1 implementation to match the performance of Aras’s SSE4.1 code. I had a slight advantage in that I just call sin_cos as a single function versus separate sin and cos functions, but meh, I’m calling my performance target reached. Final performance results are at the end of this post if you just want to skip to that.

The other part of this post is about Rust’s runtime and compile time CPU feature detection and some wrong turns I took along the way.

Target feature selection

As I mentioned in my previous post, Rust is about to stabilise several SIMD related features in 1.27. Apart from the SIMD intrinsics themselves, this adds the ability to check for CPU target features (i.e. supported instruction sets) at both compile time and runtime.

The static check involves using #[cfg(target_feature = "<feature>")] around blocks which turns code on or off at compile time. Here’s an example of compile time feature selection:

#[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

pub fn add_scalar(a: &[f32], b: &[f32], c: &mut [f32]) {
    for ((a, b), c) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
        *c = a + b;
    }
}

#[cfg(target_feature = "sse2")]
pub fn add_sse2(a: &[f32], b: &[f32], c: &mut [f32]) {
    // for simplicity assume length is a multiple of chunk size
    for ((a, b), c) in a.chunks(4).zip(b.chunks(4)).zip(c.chunks_mut(4)) {
        unsafe {
            _mm_storeu_ps(
                c.as_mut_ptr(),
                _mm_add_ps(
                    _mm_loadu_ps(a.as_ptr()),
                    _mm_loadu_ps(b.as_ptr())));
        }
    }
}

#[cfg(target_feature = "avx2")]
pub fn add_avx2(a: &[f32], b: &[f32], c: &mut [f32]) {
    // for simplicity assume length is a multiple of chunk size
    for ((a, b), c) in a.chunks(8).zip(b.chunks(8)).zip(c.chunks_mut(8)) {
        unsafe {
            _mm256_storeu_ps(
                c.as_mut_ptr(),
                _mm256_add_ps(
                    _mm256_loadu_ps(a.as_ptr()),
                    _mm256_loadu_ps(b.as_ptr())));
        }
    }
}

pub fn add(a: &[f32], b: &[f32], c: &mut [f32]) {
    #[cfg(target_feature = "avx2")]
    return add_avx2(a, b, c);
    #[cfg(target_feature = "sse2")]
    return add_sse2(a, b, c);
    #[cfg(not(any(target_feature = "avx2", target_feature = "sse2")))]
    add_scalar(a, b, c);
}

If you look at the assembly listing on Compiler Explorer you can see both add_scalar and add_sse2 are both in the assembly output (SSE2 is always available on x86-x64 targets) but add_avx2 is not as AVX2 is not. The add function has inlined the SSE2 version. The target features available it compile time can be controlled via rustc flags.

The runtime check uses the #[target_feature = "<feature>"] attribute that can be used on functions to emit code using that feature. The functions must be marked as unsafe as if they were executed on a CPU that didn’t support that feature the program would crash. The is_x86_feature_detected! can be used to determine if the feature is available at runtime and then call the appropriate function. Here’s an example of runtime feature selection:

#[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

pub fn add_scalar(a: &[f32], b: &[f32], c: &mut [f32]) {
    for ((a, b), c) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
        *c = a + b;
    }
}

#[cfg_attr(any(target_arch = "x86", target_arch = "x86_64"), target_feature(enable = "sse2"))]
pub unsafe fn add_sse2(a: &[f32], b: &[f32], c: &mut [f32]) {
    // for simplicity assume length is a multiple of chunk size
    for ((a, b), c) in a.chunks(4).zip(b.chunks(4)).zip(c.chunks_mut(4)) {
        _mm_storeu_ps(
            c.as_mut_ptr(),
            _mm_add_ps(
                _mm_loadu_ps(a.as_ptr()),
                _mm_loadu_ps(b.as_ptr())));
    }
}

#[cfg_attr(any(target_arch = "x86", target_arch = "x86_64"), target_feature(enable = "avx2"))]
pub unsafe fn add_avx2(a: &[f32], b: &[f32], c: &mut [f32]) {
    // for simplicity assume length is a multiple of chunk size
    for ((a, b), c) in a.chunks(8).zip(b.chunks(8)).zip(c.chunks_mut(8)) {
        _mm256_storeu_ps(
            c.as_mut_ptr(),
            _mm256_add_ps(
                _mm256_loadu_ps(a.as_ptr()),
                _mm256_loadu_ps(b.as_ptr())));
    }
}

pub fn add(a: &[f32], b: &[f32], c: &mut [f32]) {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    {
        if is_x86_feature_detected!("avx2") {
            return unsafe { add_avx2(a, b, c) };
        }
        if is_x86_feature_detected!("sse2") {
            return unsafe { add_sse2(a, b, c) };
        }
    }
    add_scalar(a, b, c);
}

Looking again at the assembly listing on Compiler Explorer you can see a call to std::stdsimd::arch::detect::os::check_for which depending on the result jumps to the AVX2 implementation or passes through to the inlined SSE2 implementation (again, because SSE2 is always available on x86-64).

I do not know what the Rust calling convention is but there seems to be a lot of saving and restoring registers to and from the stack around the call to check_for.

example::add:
  pushq %rbp
  pushq %r15
  pushq %r14
  pushq %r13
  pushq %r12
  pushq %rbx
  pushq %rax
  movq %r9, %r12
  movq %r8, %r14
  movq %rcx, %r13
  movq %rdx, %r15
  movq %rsi, %rbp
  movq %rdi, %rbx
  movl $15, %edi
  callq std::stdsimd::arch::detect::os::check_for@PLT
  testb %al, %al
  je .LBB3_1
  movq %rbx, %rdi
  movq %rbp, %rsi
  movq %r15, %rdx
  movq %r13, %rcx
  movq %r14, %r8
  movq %r12, %r9
  addq $8, %rsp
  popq %rbx
  popq %r12
  popq %r13
  popq %r14
  popq %r15
  popq %rbp
  jmp example::add_avx2@PLT

So there’s definitely some overhead in performing this check at runtime and that’s certainly noticeable comparing the results of my compile time AVX2 performance (51.1Mrays/s) to runtime checked AVX2 performance (47.6Mrays/s). If you know exactly what hardware your program is going to be running on it’s better to use the compile time method. In my case, the hit method is called millions of times so this overhead adds up. If I could perform this check in an outer function then it should reduce the overhead of the runtime check.

Update - 2018-06-23

There was some great discussion on /r/rust about this point. It seems like this check should be getting inlined and that it wasn’t is probably a bug. A PR has already landed in the stdsimd library to fix this, thanks to gnzlbg. It might take a while for that change to make it into stable Rust. In the interim there were a couple of other good suggestions on how I could minimise the overhead. One was to store the target feature detection result in an enum and just use a match to branch to the appropriate function. This was on the premise that the branch should get predicted and avoiding an indirect jump (function pointer) would allow the compiler to optimise better. This has certainly worked out and now my runtime target feature code performance is the same as the compile time version.

Wrappers and runtime feature selection

In my previous post I also talked about my SIMD wrapper. The purpose of this wrapper was to provide an interface that uses the widest available registers to it, the idea being I write my ray spheres collision test function using my wrapped SIMD types. The main benefits being that I have single implementation of my hit function using the wrapper type and also I get some nice ergonomics like being able to use operators like + or - on my wrapper types. Not to mention enforcing some type safety! For example introducing a SIMD boolean type that is returned by comparison functions and is then passed into blend functions - making the implicit SSE conventions explicit through the type system.

That’s all very nice, but it only worked at compile time. The catch being runtime #[target_feature(enable)] only works on functions. My compile time wrapper just imported a different module depending on what features were available. My wrapper types had a lot of functions. On top of that the hit function needed to know which version of the wrapper it needed to call.

I was interested in implementing a runtime option. My first attempt at SIMD was using the runtime target_feature method but I had dropped it to try and write a wrapper. I was skeptical that I would be able to get my wrapper working with runtime feature detection, at least not without things getting very complicated - but I thought I’d give it a try. Keep in mind, I haven’t completed this code, I just went a long way down this path before I moved on to writing separate scalar, SSE4.1 and AVX2 implementations of my hit function.

I figured one approach would be to make the hit function generic and then specify the SSE4.1 or AVX2 implementations as type parameters. I could then choose the appropriate branch at runtime. For this to work I needed to make traits for my wrapper types and all the trait methods needed to be unsafe, due to the target_feature requirement. Possibly these traits could have been simplified, but this is what I ended up with:

pub trait Bool32xN<T>: Sized + Copy + Clone + BitAnd + BitOr {
    unsafe fn num_lanes() -> usize;
    unsafe fn unwrap(self) -> T;
    unsafe fn to_mask(self) -> i32;
}

pub trait Int32xN<T, BI, B: Bool32xN<BI>>: Sized + Copy + Clone + Add + Sub + Mul {
    unsafe fn num_lanes() -> usize;
    unsafe fn unwrap(self) -> T;
    unsafe fn splat(i: i32) -> Self;
    unsafe fn load_aligned(a: &[i32]) -> Self;
    unsafe fn load_unaligned(a: &[i32]) -> Self;
    unsafe fn store_aligned(self, a: &mut [i32]);
    unsafe fn store_unaligned(self, a: &mut [i32]);
    unsafe fn indices() -> Self;
    unsafe fn blend(lhs: Self, rhs: Self, cond: B) -> Self;
}

pub trait Float32xN<T, BI, B: Bool32xN<BI>>: Sized + Copy + Clone + Add + Sub + Mul + Div {
    unsafe fn num_lanes() -> usize;
    unsafe fn unwrap(self) -> T;
    unsafe fn splat(s: f32) -> Self;
    unsafe fn from_x(v: Vec3) -> Self;
    unsafe fn from_y(v: Vec3) -> Self;
    unsafe fn from_z(v: Vec3) -> Self;
    unsafe fn load_aligned(a: &[f32]) -> Self;
    unsafe fn load_unaligned(a: &[f32]) -> Self;
    unsafe fn store_aligned(self, a: &mut [f32]);
    unsafe fn store_unaligned(self, a: &mut [f32]);
    unsafe fn sqrt(self) -> Self;
    unsafe fn hmin(self) -> f32;
    unsafe fn eq(self, rhs: Self) -> B;
    unsafe fn gt(self, rhs: Self) -> B;
    unsafe fn lt(self, rhs: Self) -> B;
    unsafe fn dot3(x0: Self, x1: Self, y0: Self, y1: Self, z0: Self, z1: Self) -> Self;
    unsafe fn blend(lhs: Self, rhs: Self, cond: B) -> Self;
}

The single letter type parameters are a bit cryptic, I was thinking about better names but was deferring that problem until later. The hit function signature also got quite complicated:

pub unsafe fn hit_simd<BI, FI, II, B, F, I>(
    &self,
    ray: &Ray,
    t_min: f32,
    t_max: f32,
) -> Option<(RayHit, u32)>
where
    B: Bool32xN<BI> + BitAnd<Output = B>,
    F: Float32xN<FI, BI, B> + Add<Output = F> + Sub<Output = F> + Mul<Output = F>,
    I: Int32xN<II, BI, B> + Add<Output = I> {
    // impl
}

Again with the cryptic type parameter names. These basically represented the wrapper type for f32, bool and i32 and the arch specific SIMD type, e.g. __m128 and __m256. This got pretty close to compiling but at this point I the operators didn’t compile and because the operators defined in std::ops they aren’t labelled as unsafe. I guess I could have wrapped an unsafe function call from the safe op trait implementation but that seemed pretty unsafe. At that point I gave up and added an unwrapped AVX2 version of the hit function. It took about an hour compared to several hours I’d sunk into my generic approach.

There are obvious downsides to maintaining multiple implementations of the same function. Adding some tests would help mitigate that but if I end up writing more types of ray collisions or add support for another arch I’m obviously adding a lot more code and potential sources of (maybe duplicated) bugs. I did feel at this point things were getting too complicated for not much gain in the scheme of my little path tracing project.

I thought this story might be interesting to people wanting to use the runtime target feature selection in Rust. It is nice having one executable that can take advantage of the CPU features that it’s running on but it also means some constraints on how you author your code. The unfinished runtime wrapper code is here if you are curious.

Final performance results

Test results are from my laptop running Window 10 home. I’m compiling with cargo build --release of course with rustc 1.28.0-nightly (71e87be38 2018-05-22). My performance numbers for each iteration of my path tracer are:

Feature Mrays/s
Aras’s C++ SSE4.1 version 45.5
Static SSE4.1 w/ LTO 45.8
Static AVX2 w/ LTO 52.1
Dynamic (AVX2) w/ LTO 52.1

Update 2018-06-23 Updated performance numbers for static and dynamic versions of the code - thanks to some caching of the target feature value the performance is about this same.

The static branch lives here and the dynamic branch here. My machine supports AVX2 so that’s what the dynamic version ends up using.

In summary, if you don’t know what CPU your code is going to run on you could get a nice little boost by checking target features at runtime at the loss of a bit of flexibility around how you structure your code. If you have a fixed hardware target then it’s better to compile for that hardware and avoid the overhead and code restrictions of target_feature.

As always if you have any comments you can reach me on Twitter at @bitshifternz or on /r/rust.

Optimising path tracing with SIMD

Following on from path tracing in parallel with Rayon I had a lot of other optimisations I wanted to try. In particular I want to see if I could match the CPU performance of @aras_p’s C++ path tracer in Rust. He’d done a fair amount of optimising so it seemed like a good target to aim for. To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. I also implemented a live render loop mimicking his.

the final result

My initial unoptimized code was processing 10Mrays/s on my laptop. Aras’s code (with GPGPU disabled) was doing 45.5Mrays/s. I had a long way to go from here!

tl;dr did I match the C++ in Rust? Almost. My SSE4.1 version is doing 41.2Mrays/s about 10% slower than the target 45.5Mrays/s running on Windows on my laptop. The long answer is more complicated but I will go into that later.

Overview

This is potentially going to turn into a long post, so here’s what I’m going to try and cover:

What is SIMD?

A big motivation for me doing this was to write some SIMD code. I knew what SIMD was but I’d never actually written any, or none of significance at least.

If you are unfamiliar with SIMD, it stands for Single Instruction Multiple Data. What this means is there are registers on most common CPUs which contain multiple values, e.g. 4 floats, that can be processed with a single instruction.

SIMD add

The image above shows adding two SIMD vectors which each contain 4 elements together in the single instruction. This increases the throughput of math heavy code. On Intel SSE2 has been available since 2001, providing 128 bit registers which can hold 4 floats or 2 doubles. More recent chips have AVX which added 256 bit registers and AVX512 which added 512 bit registers.

In this path tracer we’re performing collision detection of a ray against an array of spheres, one sphere at a time. With SIMD we could check 4 or even 8 spheres at a time.

There’s a few good references I’m using for this:

SIMD support in Rust

Rust SIMD support is available in the nightly compiler and should make it into a stable Rust release soon. There are a couple of related RFCs around SIMD support.

  • Target feature RFC#2045 adds the ability to:
    • determine which features (CPU instructions) are available at compile time
    • determine which features are available at run-time
    • embed code for different sets of features in the same binary
  • Stable SIMD RFC#2325 adds support for:
    • 128 and 256 bit vector types and intrinsics for Intel chipsets, e.g. SSE and AVX - other instruction sets will be added later
    • the ability to branch on target feature type at runtime so you can use a feature if it’s available on the CPU the program is running on

One thing that is not currently being stabilised is cross platform abstraction around different architecture’s SIMD instructions. I’m sticking with SSE for now as I want to get experience with a native instruction set rather than an abstraction.

Another thing worth mentioning is the SIMD intrinsics are all labelled unsafe, using intrinsics is a bit like calling FFI functions, they aren’t playing by Rust’s rules and are thus unsafe. Optimised code often sacrifices a bit of safety and readability for speed, this will be no exception.

Converting Vec3 to SSE2

Most of the advice I’ve heard around using SIMD is just making your math types use it not the way to get performance and that you are better to just write SIMD math code without wrappers. One reason is Vec3 is only using 3 of the available SIMD lanes, so even on SSE you’re only 75% occupancy and you won’t get any benefit from larger registers. Another reason is components in a Vec3 are related but values in SIMD vector lanes have no semantic relationship. In practice what this means is doing operations across lanes like a dot product is cumbersome and not that efficient. See “The Awkward Vec4 Dot Product” slide on page 43 of this GDC presentation.

Given all of the above it’s not surprising that Aras blogged that he didn’t see much of a gain from converting his float3 struct to SIMD. It’s actually one of the last things I implemented. I followed the same post he did on “How to write a maths library in 2016” except for course in Rust rather than C++. My SSE2 Vec3 can be found here. This actually gave me a pretty big boost, from 10Mrays/s to 20.7Mrays/s without any other optimisations. That’s a large gain so why did I see this when Aras’s C++ version only saw a slight change?

I think the answer has to do with my next topic.

Floating point in Rust

I think every game I’ve ever worked on has enabled to the fast math compiler setting, -ffast-math on GCC/Clang or /fp:fast on MSVC. This setting sacrifices IEEE 754 conformance to allow the compiler to optimise floating point operations more aggressively. Agner Fog recently released a paper on NaN propagation which talks about the specifics of what -ffast-math actually enables. There is currently no equivalent of -ffast-math for the Rust compiler and that doesn’t look like something that will change any time soon. There’s a bit of discussion about how Rust might support it, there are currently intrinsics for fast float operations which you can use in nightly but no plans for stable Rust. I found a bit of discussion on the Rust internals forums like this post but it seems like early days.

My theory on why I saw a large jump going from precise IEEE 754 floats to SSE was in part due to the lack of -ffast-math optimisations. This is speculation on my part though, I haven’t done any digging to back it up.

A more well known floating point wart in Rust is the distinction between PartialOrd and Ord traits. This distinction exists because floating point can be Nan which means that it doesn’t support total order which is required by Ord but not PartialOrd. As a consequence you can’t use a lot of standard library functions like sort with floats. This is an ergonomics issue rather than a performance issue but it is one I ran into working on this so I thought I’d mention it. I’ve seen the noisy_float crate recommended as a work around for float ergonomics issues but I haven’t tried it.

Preparing data for SIMD

As I mentioned earlier, SSE2 should allow testing a ray against 4 spheres at a time, sounds simple enough. The original scalar hit test code looks like so:

impl Sphere {
    pub fn hit(&self, ray: &Ray, t_min: f32, t_max: f32) -> Option<RayHit> {
        let oc = ray.origin - self.centre;
        let a = dot(ray.direction, ray.direction);
        let b = dot(oc, ray.direction);
        let c = dot(oc, oc) - self.radius * self.radius;
        let discriminant = b * b - a * c;
        if discriminant > 0.0 {
            let discriminant_sqrt = discriminant.sqrt();
            let t = (-b - discriminant_sqrt) / a;
            if t < t_max && t > t_min {
                let point = ray.point_at_parameter(t);
                let normal = (point - self.centre) / self.radius;
                return Some(RayHit { t, point, normal });
            }
            let t = (-b + discriminant_sqrt) / a;
            if t < t_max && t > t_min {
                let point = ray.point_at_parameter(t);
                let normal = (point - self.centre) / self.radius;
                return Some(RayHit { t, point, normal });
            }
        }
        None
    }
}

Reasonably straight forward, the first five lines are simple math operations before some conditionals to determine if the ray hit or not. To convert this to SSE2 first we need to load the data into the SIMD registers, do the math bit and finally extract which one of the SIMD lanes contains the result we’re after (the nearest hit point).

First we need to splat x, y, and z components of the ray origin and direction into SSE registers. They we need to iterate over 4 spheres at a time, loading 4 x, y and z components of the sphere’s centre points into registers containing 4 x components, 4 y components and so on.

For the ray data use the _mm_set_ps1 intrinsic to splat each ray origin and direction component into across 4 lanes of an SSE register.

// load ray origin
let ro_x = _mm_set_ps1(ray.origin.get_x());
let ro_y = _mm_set_ps1(ray.origin.get_y());
let ro_z = _mm_set_ps1(ray.origin.get_z());
// load ray direction
let rd_x = _mm_set_ps1(ray.direction.get_x());
let rd_y = _mm_set_ps1(ray.direction.get_y());
let rd_z = _mm_set_ps1(ray.direction.get_z());

There are 3 ways that I’m aware of to load the sphere data for SSE:

  • _mm_set_ps which takes 4 f32 parameters
  • _mm_loadu_ps which takes an unaligned *const f32 pointer
  • _mm_load_ps which takes a 16 byte aligned *const f32 pointer

I did a little experiment to see what the assembly generated by these might look like (godbolt):

pub fn set_and_add(a: f32, b: f32, c: f32, d: f32, e: f32, f: f32, g: f32, h: f32) -> __m128 {
  unsafe {
    _mm_add_ps(
        _mm_set_ps(a, b, c, d),
        _mm_set_ps(e, f, g, h))
  }
}

pub fn load_unaligned_and_add(a: &[f32;4], b: &[f32;4]) -> __m128 {
  unsafe {
      _mm_add_ps(
          _mm_loadu_ps(a as *const f32),
          _mm_loadu_ps(b as *const f32))
  }
}

pub fn load_aligned_and_add(a: &[f32;4], b: &[f32;4]) -> __m128 {
  unsafe {
      _mm_add_ps(
          _mm_load_ps(a as *const f32),
          _mm_load_ps(b as *const f32))
  }
}

And the generated assembly:

playground::set_and_add:
  unpcklps	%xmm0, %xmm1
  unpcklps	%xmm2, %xmm3
  movlhps	%xmm1, %xmm3
  unpcklps	%xmm4, %xmm5
  unpcklps	%xmm6, %xmm7
  movlhps	%xmm5, %xmm7
  addps	%xmm3, %xmm7
  movaps	%xmm7, (%rdi)
  movq	%rdi, %rax
  retq

playground::load_unaligned_and_add:
  movups	(%rsi), %xmm0
  movups	(%rdx), %xmm1
  addps	%xmm0, %xmm1
  movaps	%xmm1, (%rdi)
  movq	%rdi, %rax
  retq

playground::load_aligned_and_add:
  movaps	(%rsi), %xmm0
  addps	(%rdx), %xmm0
  movaps	%xmm0, (%rdi)
  movq	%rdi, %rax
  retq

The first option has to do a lot of work to get data into registers before it can perform the addps. The last option appears to be doing the least work based on instruction count, the addps can take one of it’s operands as a memory address because it’s aligned correctly.

My initial implementation was an array of Sphere objects. The x values for 4 spheres are not in contiguous memory, so I would be forced to use _mm_set_ps to load the data. To use the _mm_loadu_ps I need to store my Sphere data in Structure of Arrays (SoA) format. SoA means instead an array of individual Sphere objects I have a struct containing arrays for each Sphere data member. My SpheresSoA struct looks like so:

pub struct SpheresSoA {
    centre_x: Vec<f32>,
    centre_y: Vec<f32>,
    centre_z: Vec<f32>,
    radius_sq: Vec<f32>,
    radius_inv: Vec<f32>,
    len: usize,
}

There are other advantages to SoA. In the above structure I’m storing precalculated values for radius * radius and 1.0 / radius as if you look at the original hit function radius on it’s own is never actually used. Before switching to SoA adding precalculated values to my Sphere struct actually made things slower, presumably because I’d increased the size of the Sphere struct, reducing my CPU cache utilisation. With SpheresSoA if there is no ray hit result then the radius_inv data will never be accessed, so it’s not wastefully pulled into CPU cache. Switching to SoA did speed up my scalar code too although I didn’t note down the performance numbers at the time.

Now to iterator over four spheres at a time my loop looks like:

// loop over 4 spheres at a time
for (((centre_x, centre_y), centre_z), radius_sq) in self
  .centre_x
  .chunks(4)
  .zip(self.centre_y.chunks(4))
  .zip(self.centre_z.chunks(4))
  .zip(self.radius_sq.chunks(4))
{
  // load sphere centres
  let c_x = _mm_loadu_ps(centre_x.as_ptr());
  let c_y = _mm_loadu_ps(centre_y.as_ptr());
  let c_z = _mm_loadu_ps(centre_z.as_ptr());
  // load radius_sq
  let r_sq = _mm_loadu_ps(radius_sq.as_ptr());
  // ... everything else
}

Because I’m iterating in chunks of 4 I make sure the arrays sizes are a multiple of 4, adding some bogus spheres to the end to pad the array out of necessary. I did try using get_unchecked on each array instead of chunks(4) and zip but at the time it appeared to be slower. I might try it again though as the iterator version is a bit obtuse and generates a lot of code. Note that I’m using _mm_loadu_ps because my Vec<f32> arrays are not 16 byte aligned. I’ll talk about that later.

Getting the result out again

I won’t paste the full SSE2 (actually I’m using a little SSE4.1) ray spheres intersection test source here because it’s quite verbose, but it can be found here if you’re interested. I mostly wrote this from scratch using the Intel Intrinsics Guide to match my scalar code to the SSE2 equivalent and using Aras’s post on SSE sphere collision and subsequent post on ryg’s optimisations as a guide. I think I had got pretty close to completion at the end of one evening but wasn’t sure how to extract the minimum hit result out of the SSE vector so just went for the scalar option:

// copy results out into scalar code to get return value (if any)
let mut hit_t_array = [t_max, t_max, t_max, t_max];
_mm_storeu_ps(hit_t_array.as_mut_ptr(), hit_t);
// this is more complicated that it needs to be because I couldn't
// just use 'min` as f32 doesn't support `Ord` the trait
let (hit_t_lane, hit_t_scalar) = hit_t_array.iter().enumerate().fold(
    (self.len, t_max),
    |result, t| if *t.1 < result.1 { (t.0, *t.1) } else { result },
);
if hit_t_scalar < t_max {
  // get the index of the result in the SpheresSoA vectors and return it
}

All the above code is doing is finding the minimum value in the hit_t SIMD lanes to see if we have a valid hit. Aras’s code what able to do this in SSE2 with his h_min function (h standing for horizontal meaning across the SIMD register lanes), but I wanted to understand how it was working before I used it. The main reason I didn’t understand h_min is I’d never used a shuffle before. I guess they’re familiar to people who write a lot of shaders since swizzling is pretty common but I haven’t done a lot of that either.

My Rust hmin looks like:

v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _mm_shuffle!(0, 0, 3, 2)));
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _mm_shuffle!(0, 0, 0, 1)));
_mm_cvtss_f32(v)

What this is doing is combining _mm_shuffle_ps and _mm_min_ps to compare all of the lanes with each other and shuffle the minimum value into lane 0 (the lowest 32 bits of the vector).

The shuffle is using a macro that encodes where lanes are being shuffled from, it’s a Rust implementation of the _MM_SHUFFLE macro found in xmmintrin.h:

macro_rules! _mm_shuffle {
    ($z:expr, $y:expr, $x:expr, $w:expr) => {
        ($z << 6) | ($y << 4) | ($x << 2) | $w
    };
}

So for example to reverse the order of the SIMD lanes you would do:

_mm_shuffle_ps(a, a, _mm_shuffle!(0, 1, 2, 3))

Which is saying move lane 0 to lane 3, lane 1 to lane 2, lane 2 to lane 1 and lane 3 to lane 0. Lane 3 is the left most lane and lane 0 is the rightmost, which seems backwards but I assume it’s due to x86 being little endian. As you can see _mm_shuffle_ps takes two operands, so you can interleave lanes from two values into the result but I haven’t had a use for that so far.

So given that, if we put some values into hmin:

let a = _mm_loadu_ps([A, B, C, D]);
// a = [A, B, C, D]
let b = _mm_shuffle_ps(a, a, _mm_shuffle!(0, 0, 3, 2));
// b = [D, D, A, B];
let c = mm_min_ps(a, b);
// c = [min(A, D), min(B, D), min(C, A), min(D, B)];
let d = _mm_shuffle_ps(c, c, _mm_shuffle!(0, 0, 0, 1));
// d = [min(A, D), min(A, D), min(A, D), min(C, A)];
let e = _mm_min_ps(c, d);
// e = [min(A, D), min(A, B, D), min(A, C, D), min(A, B, C, D)]
let f = _mm_cvtss_f32(e);
// f = min(A, B, C, D)

The result in lane 0 has been compared with all of the values in the 4 lanes. I thought this was a nice example of the kind of problems that need to be solved when working with SIMD.

Aligning data

As I mentioned earlier that the best way that I’m aware of to load data into SSE registers is with a f32 array aligned to 16 bytes so you can use the _mm_load_ps intrinsic. Unfortunately Rust’s std::Vec doesn’t support custom allocators so there’s no way to specify a specific alignment for my Vec<f32>.

Another way would be to store an aligned arrays of floats that are the same size as our SIMD vector. I did this with a [repr(align(16)] wrapper around a float array which looks like:

pub mod m128 {
    #[repr(C, align(16))]
    pub struct ArrayF32xN(pub [f32; 4]);
}

pub mod m256 {
    #[repr(C, align(32))]
    pub struct ArrayF32xN(pub [f32; 8]);
}

I actually just wanted to use [repr(align(n))] because I implemented it in the compiler and I’d never had an excuse to use it. Probably it would be easier to make a union of an f32 array and the SIMD vector type.

In any case my SpheresSoA now looks like:

pub struct SpheresSoA {
    centre_x: Vec<ArrayF32xN>,
    centre_y: Vec<ArrayF32xN>,
    centre_z: Vec<ArrayF32xN>,
    radius_sq: Vec<ArrayF32xN>,
    radius_inv: Vec<ArrayF32xN>,
    num_chunks: u32,
    num_spheres: u32,
}

So I can remove the chunks(4) from my ray spheres intersection loop as they spheres are already in ArrayF32xN sized chunks.

SIMD wrapper with AVX2 support

I had intended to write a wrapper later after getting some hands on experience with SSE intrinsics. One reason being that

let discr = nb * nb - c;

Is a lot easier to read and write than

let discr = _mm_sub_ps(_mm_mul_ps(nb, nb), c);

The other reason was I had noticed that mostly the SSE code for the ray spheres collision wasn’t specific to the number of lanes. So I thought if I wrote a wrapper then it wouldn’t be that difficult to add AVX support which has 8 vectors in a float but still have exactly the same collision code.

The way I implemented the wrapper was inside different modules for the 128 bit, 256 bit and 32 bit scalar fallback versions. Then I just use the one that is available for target features enabled by the compiler. This means I’m using compile time target feature selection but not runtime target feature selection. I haven’t worked out exactly how I was to implement the runtime version yet.

In a very abridged version for example I have the following modules:

pub mod simd {
  #[cfg(target_feature = "sse2")]
  pub mod m128 {
    pub struct f32xN(pub __m128);
  }

  #[cfg(target_feature = "avx2")]
  pub mod m256 {
    pub struct f32xN(pub __m256);
  }

  pub mod m32 {
    pub struct f32xN(pub f32);
  }

  // re-export fallback scalar code
  #[cfg(not(any(target_feature = "sse2", target_feature = "avx2")))]
  pub use self::m32::*;

  // re-export sse2 if no avx2
  #[cfg(all(target_feature = "sse2", not(target_feature = "avx2")))]
  pub use self::m128::*;

  // re-export avx2
  #[cfg(target_feature = "avx2")]
  pub use self::m256::*;
}

So user code can use simd::f32xN and it will use the widest SIMD registers available to the compiler. I have a bunch of other types I’m creating like ArrayF32xN which I use to align data for loading into f32xN, there’s i32xN for integer SIMD and b32xN for boolean values. I haven’t wrapped every possible instruction, just what I was using for ray sphere intersections. I’m not using traits for these to define a common interface for the different bit sizes although I possibly should.

My wrapper code can be found here and the ray sphere intersection test using it is here.

For x86_64 targets rustc enables SSE2 by default. There are a couple of different ways to enable other target features, one is RUSTFLAGS and the other is .cargo/config. The latter was most useful for me as it meant I didn’t need to either set flags globally or remember to set them on every invocation of the compiler. My .cargo/config to enable AVX2 looks like this:

[target.'cfg(any(target_arch = "x86", target_arch = "x86_64"))']
rustflags = ["-C", "target-feature=+avx2"]

Final performance results

I primarily testing performance on my laptop running Window 10 home. I’m compiling with cargo build --release of course with rustc 1.28.0-nightly (71e87be38 2018-05-22). My performance numbers for each iteration of my path tracer are:

Feature Mrays/s
Aras’s C++ SSE4.1 version 45.5
My starting point 10.0
SSE2 Vec3 20.7
Plain SSE4.1 38.1
Wrapped SSE4.1 38.7
Wrapped AVX2 41.9
Wrapped SSE4.1 w/ LTO 41.2
Wrapped AVX2 w/ LTO 45.0

The difference between the ‘Plain SSE4.1’ and ‘Wrapped SSE4.1’ are because I added the hmin changes and aligned SpheresSoA data at the same time as I added the SIMD wrappers. LTO stands for link time optimisation. It did give a nice increase in performance, but at a large increase in build time.

I do most of my development on Linux and was confused for a long time because my Linux performance was much higher than what I was seeing in Windows on the same laptop. At the same time the Linux peaks were higher but it fluctuated a lot as the path tracer ran. The Windows numbers were lower but stable. This difference is very likely due to CPU frequency scaling differences between Linux and Windows. It does make benchmarking on my Linux laptop a bit unreliable though.

I did occasionally test on a desktop machine too, which gave the following results:

Feature Mrays/s
Aras’s C++ SSE4.1 version 47.7
Wrapped SSE4.1 43.0
Wrapped AVX2 47.0
Wrapped SSE4.1 w/ LTO 45.8
Wrapped AVX2 w/ LTO 51.0

So in this case my AVX2 version was faster, which is nice but not really an apples versus apples comparison.

So what differences are remaining? A couple of things stood out in my profiling:

The C++ compiler appeared to be linking in an SSE2 implementation of sin and cos, Rust is using the f64 sin and cos in the MSVC toolchain version of the compiler

Rayon was higher overhead than Enki (the C thread pool Aras is using). There were some pretty callstacks in Rayon and the hot function that was showing up in VTune was pretty large. Enki by comparison is pretty simple.

Addressing these things might get me a bit closer.

It’s also a bit surprising that AVX2 didn’t make a huge difference, it seemed to be bottlenecked on performing square roots.

If you made it this far well done, thanks for reading! I covered a lot in the post but at the same time I left out a lot of details. If you want to know more take a look my spheres_simd_wrapped branch which has my most optimised code and feel free to ask questions or give feedback via @bitshifternz on Twitter or on my /r/rust post on Reddit.

Path tracing in parallel with Rayon

The path tracer I talked about in my previous post runs on one core, but my laptop’s CPU has 4 physical cores. That seems like an easy way to make this thing faster right? There’s a Rust library called Rayon which provides parallel iterators to divide your data into tasks and run it across multiple threads.

One of the properties of Rust’s type system is it detects shared memory data races at compile time. This property is a product of Rust’s ownership model which does not allow shared mutable state. You can read more about this in the Fearless Concurrency chapter of the Rust Book or for a more formal analysis Securing the Foundations of the Rust Programming Language. As a consequence of this, Rayon’s API also guarantees data-race freedom.

Iterators

My initial Rust implementation of the main loop was pretty similar to the original C++. I’m reserving space in a vector and pushing RGB values into it for each pixel.

let mut buffer = vec![0u8; (nx * ny * channels) as usize];
for j in 0..ny { // loop through rows
  for i in 0..nx { // loop through columns
    let mut col = vec3(0.0, 0.0, 0.0);
    for _ in 0..ns { // generate ns samples for each pixel
      let u = (i as f32 + rng.next_f32()) / nx as f32;
      let v = ((ny - j - 1) as f32 + rng.next_f32()) / ny as f32;
      let ray = camera.get_ray(u, v, &mut rng);
      col += scene.ray_trace(&mut ray, 0, &mut rng);
    }
    col /= ns as f32;
    // push each colour channel to the output buffer
    buffer.push((255.99 * col.x.sqrt()) as u8);
    buffer.push((255.99 * col.y.sqrt()) as u8);
    buffer.push((255.99 * col.z.sqrt()) as u8);
  }
}

The above code is looping through all j rows and all i columns and ray casting ns samples for each pixel to produce a final colour, which is then pushed into the output buffer. This takes about 39 seconds to process on my laptop.

I don’t strictly need to keep this structure to use Rayon, but processing each row in parallel sounds like an OK starting point.

If I was to parallelise this in C++ I would make sure the output buffer is pre-created (not just reserved) since I know exactly how big it is beforehand. In C++ I would determine the output address from the i and j values and just write there since I know each thread will be writing to a different location there are no data races, assuming I don’t make any mistakes. To do the same in Rust would require each thread to have a mutable reference to the output buffer, but multiple mutable references are not allowed in safe Rust. You can achieve this in unsafe Rust, and that’s what the iterators are using behind the scenes which is presented as a safe API for the programmer. Coming from C++ I’m not used to Rust’s heavier use of iterators and it usually takes me a while to translate my loops into idiomatic Rust.

There is a base Iterator trait which all Iterators are built from:

trait Iterator {
    type Item;
    fn next(&mut self) -> Option<Self::Item>;
}

There’s not much to it - just one method when returns an Option type. There are more sophisticated traits built on top of that to create quite a rich variety of iterators. You can read more about iterators in the Processing a Series of items with Iterators chapter of the Rust book.

Back to my loop. I want to iterate over my output buffer a row at a time, generating a number of samples per pixel and finally writing the RGB values to the output. The slice method chunks_mut returns an iterator over the slice in non-overlapping mutable chunks. Using this should give me my rows of pixels. I also want to process rows in reverse so the image is the right way up, the rev method will return a reverse iterator. I need to know which row I’m processing, enumerate is an iterator which will return a tuple with the number of the row being iterated and the row itself. I’m doing this in a for_each iterator which takes a closure. Inside the closure I’m using chunks_mut chained with enumerate again to iterate over each RGB pixel in the row and for each pixel I’m ray tracing nx samples, the same as the original loop. Finally I get an iterator to the RGB pixels and write out each channel. Phew!

buffer
  .chunks_mut((nx * channels) as usize) // iterate each row
  .rev() // iterate in reverse (otherwise image is upside down)
  .enumerate() // generate an index for each row we're iterating
  .for_each(|(j, row)| {
    for (i, rgb) in row.chunks_mut(channels as usize).enumerate() {
      let mut col = vec3(0.0, 0.0, 0.0);
      for _ in 0..ns {
        let u = (i as f32 + rng.next_f32()) / nx as f32;
        let v = (j as f32 + rng.next_f32()) / ny as f32;
        let ray = camera.get_ray(u, v, &mut rng);
        col += scene.ray_trace(&ray, 0, &mut rng);
      }
      col /= ns as f32;
      let mut iter = rgb.iter_mut();
      *iter.next().unwrap() = (255.99 * col.x.sqrt()) as u8;
      *iter.next().unwrap() = (255.99 * col.y.sqrt()) as u8;
      *iter.next().unwrap() = (255.99 * col.z.sqrt()) as u8;
    }
  });

It took me quite a while to work out the right combination of iterators to produce a nested loop that behaved similarly to the original. The iterator version of the code runs in ~37 seconds, very slightly faster than the original.

Now that I have this, calculating the rows in parallel is simply a case of changing the first chunks_mut to par_chunks_mut and…

Data races

As I mentioned earlier, Rust detects data races at compile time and there were a couple in my code, this is the abridged version:

error[E0387]: cannot borrow data mutably in a captured outer variable in an `Fn` closure
  --> src/main.rs:95:41
   |
95 |                     let u = (i as f32 + rng.next_f32()) / nx as f32;
   |                                         ^^^
   |

error[E0387]: cannot borrow data mutably in a captured outer variable in an `Fn` closure
  --> src/main.rs:99:28
   |
99 |                     col += scene.ray_trace(&ray, 0, &mut rng);
   |                            ^^^^^
   |

I had added a counter to the Scene struct which is updated in calls to ray_trace, so this method takes &mut self. I made the counter a std::sync::atomic::AtomicUsize. The other race was the rng which was captured in the for_each closure of the main loop. I just construct a new rng for each row instead.

Once these issues were fixed it compiled and ran in ~10 seconds, nearly a 4x speed up which is what I would hope for on a 4 core machine. Although, I do have 8 logical cores my understanding is I probably wouldn’t get any extra speed out of these, unfortunately my BIOS doesn’t have an option to disable hyper threading.

After changing my loop to use iterators it was a one line change to my main loop to make it run in parallel using Rayon and thanks to Rust all the data races in my code failed to actually compile. That is much easier to deal with than data races happening at runtime.

I still have a lot of optimisations I want to try, primarily making my spheres SOA and trying out Rust’s SIMD support.

The code for this post can be found on github.

If you have any feedback you can comment on the /r/rust or twitter threads.

Ray Tracing in a Weekend in Rust

I was inspired to work through Peter Shirley’s Ray Tracing in a Weekend mini book (for brevity RTIAW) but I wanted to write it in Rust instead of the C++ that’s used in the book. I found out about the book via @aras_p’s blog series about a toy path tracer he’s been building.

This post will describe how I went about translating a C++ project to Rust, so it’s really intended to be an introduction to Rust for C++ programmers. I will introduce some of the Rust features I used and how they compare to both the C++ used in RTIAW’s code and more “Modern” C++ features that are similar to Rust. I probably won’t talk about ray tracing much at all so if you are interested in learning about that I recommend reading Peter’s book!

Additionally neither the book C++ or my Rust are optimized code, Aras’s blog series covers a lot of different optimizations he’s performed, I have not done that yet. My Rust implementation does appear to perform faster than the C++ (~40 seconds compared to ~90 seconds for a similar sized scene). I have not investigated why this is the case, but I have some ideas which will be covered later. I mostly wanted to check that my code was in the same ball park and it certainly seems to be.

Materials

RTIAW introduces three materials, Lambertian, Metal and Dielectric. These materials implement a common interface in the C++ code:

class material  {
  public:
    virtual bool scatter(
      const ray& r_in,
      const hit_record& rec,
      vec3& attenuation,
      ray& scattered) const = 0;
};

class metal : public material {
  public:
    metal(const vec3& a, float f);
    virtual bool scatter(
      const ray& r_in,
      const hit_record& rec,
      vec3& attenuation,
      ray& scattered) const;
    vec3 albedo;
    float fuzz;
};

Rust doesn’t have classes, it’s not strictly speaking an OOP language (see is Rust an OOP Language. That doesn’t mean you can’t achieve some of the useful things OOP provides like encapsulation and polymorphism. There are a couple of approaches to translating this interface to Rust. Rust traits are a bit like an abstract interface, although they aren’t attached to a specific type, types implement the traits. So for example we could define a material trait:

pub trait Material {
  fn scatter(
    &self,
    r_in: &Ray,
    rec: &HitRecord,
    attenuation: &mut Vec3,
    scattered: &mut Vec)
  -> bool;
}

struct Metal {
  albedo: Vec3,
  fuzz: f32,
}

impl Material for Metal {
  fn scatter(
    &self,
    r_in: &Ray,
    rec: &HitRecord,
    attenuation: &mut Vec3,
    scattered: &mut Vec
  ) -> bool {
    // do stuff
  }
}

Note that in Rust struct data is declared separately to it’s methods (the impl) and the trait implementation is separate again. Personally I like this separation of data from implementation, I think it makes it easier to focus on what the data is. The first method parameter is &self, Rust uses an explicit self instead of the implicit this used in C++ method calls. Variables are immutable by default, so our output variables here are declared as mutable references with &mut.

That ends up feeling pretty similar to the C++ code. In the RTIAW code the sphere object owns the material. This means material is heap allocated as each concrete material type could be a different size, the easy approach is to heap allocate the object and store a pointer to it. This is also true in Rust, if I wanted my Sphere to own a Material object I would need to store a Box<Material> on the Sphere. You can think of a box as similar to std::unique_ptr in C++.

Since there are a small number of materials and the data size of the different types of materials is not large I decided to implement these with Rust enums instead. Using enums has the advantage that their size is know at compile time so we can store them by value instead of by pointer and avoid some allocations and indirection. My enum looks like this:

#[derive(Clone, Copy)]
pub enum Material {
  Lambertian { albedo: Vec3 },
  Metal { albedo: Vec3, fuzz: f32 },
  Dielectric { ref_idx: f32 },
}

Each enum variant contains data fields. I’ve named my fields for clarity but you don’t have to. Rust enums are effectively tagged unions. C++ unions are untagged and have restrictions on the data you can store in them. “Tagged” just means storing some kind of type identifier. The #[derive(Clone, Copy)] just tells Rust that this enum is trivially copyable, e.g. OK to memcpy under the hood. To implement the scatter method we pattern match on the material enum:

impl Material {
  fn scatter(&self, ray: &Ray, ray_hit: &RayHit, rng: &mut Rng)
  -> Option<(Vec3, Ray)> {
    match *self {
      Material::Lambertian { albedo } => {
        // lambertian implementation
      }
      Material::Metal { albedo, fuzz } => {
        // metal implementation
      }
      Material::Dielectric { ref_idx } => {
        // dielectric implementation
      }
    }
  }
}

The match statement in Rust is like C/C++ switch on steroids. I’m not doing anything particularly fancy in this match, one thing I am doing though is destructuring the different enum variants to access their fields, which I then use in the specific implementation for each material.

It’s also worth talking about the return type here. The RTIAW C++ scatter interface returns a bool if the material scattered the ray and returns attenuation and scattered via reference parameters. This API does leave the question, what are these return parameters set to when scatter returns false? The RTIAW implementation only uses these values if scatter returns true but in the case of the metal material the scattered ray is calculated regardless. To avoid any ambiguity, I’m returning these values as Option<(Vec3, Ray)>. There are a couple of things going on here. First the (Vec3, Ray) is a tuple, I was too lazy to make a dedicated struct for this return type and tuples are pretty easy to work with. The option type is an optional value, it can either contain Some value or None if it does not.

This scatter call and it’s return value are handled like so:

if let Some((attenuation, scattered)) =
  ray_hit.material.scatter(ray_in, &ray_hit, rng)
{
  // do stuff
}

The if let syntax is a convenient way to perform pattern matching when you only care about one value, in this case the Some. Destructuring is being used again here to access the contents of the tuple returned in the Option.

C++ does have support for both tuple in C++11 and optional in C++17 so I’ve written something somewhat equivalent to the Rust version using C++17 below. I find the Rust a lot more ergonomic and readable, not to mention there are no exceptions to worry about.

#include <optional>
#include <tuple>
#include <variant>

using std::optional;
using std::tuple;
using std::variant;

struct Vec3 { float x; float y; float z; };
struct Ray { Vec3 origin; Vec3 direction; };
struct RayHit;

struct Lambertian {
  Vec3 albedo;
  optional<tuple<Vec3, Ray>> scatter(
    const Vec3&, const Ray&, const RayHit&);
};

struct Metal {
  Vec3 albedo;
  float fuzz;
  optional<tuple<Vec3, Ray>> scatter(
    const Vec3&, const Ray&, const RayHit&);
};

struct Dielectric {
  float ref_idx;
  optional<tuple<Vec3, Ray>> scatter(
    const Vec3&, const Ray&, const RayHit&);
};

typedef variant<Lambertian, Metal, Dielectric> Material;

optional<tuple<Vec3, Ray>> scatter(
    const Material & mat, const Ray& ray, const RayHit & hit) {
  if (auto p = std::get_if<Lambertian>(&mat)) {
    return p->scatter(ray, hit);
  }
  else if (auto p = std::get_if<Metal>(&mat)) {
    return p->scatter(ray, hit);
  }
  else if (auto p = std::get_if<Dielectric>(&mat)) {
    return p->scatter(ray, hit);
  }
  return {};
}

// dummy function declaration to prevent dead code removal
void dummy(const Vec3&, const Ray&);

// dummy function to call the scatter code
void test(const Ray& ray, const RayHit& hit, const Material& mat) {
  if (auto result = scatter(mat, ray, hit)) {
    const auto & attenuation = std::get<0>(*result);
    const auto & scattered = std::get<1>(*result);
    dummy(attenuation, scattered);
  }
}

Hitables

RTIAW introduces a ray collision result structure hit_record and a hitable abstract interface which is implemented for sphere in the book with the intention of adding other objects later. The C++ code looks like so:

class material;

struct hit_record {
  float t;  
  vec3 p;
  vec3 normal; 
  material *mat_ptr;
};

class hitable  {
  public:
    virtual bool hit(
      const ray& r,
      float t_min,
      float t_max,
      hit_record& rec) const = 0;
};

In this instance since we only ever deal with sphere’s I didn’t bother creating a Hitable trait and just added a hit method to my Sphere type. This meant that my spheres can stored in contiguous memory unlike the C++ code where each sphere is stored as a hitable pointer, which is heap allocated. This probably explains the performance difference I saw in my Rust version - there will be less cache misses. Not that my Sphere implementation is particularly efficient, it contains data like the material which wouldn’t be used most of the time so a future optimization would be to split the sphere data into a structure of arrays for better cache utilisation and SIMD usage.

I name my Rust implementation of hit_record RayHit:

struct RayHit {
  t: f32,
  point: Vec3,
  normal: Vec3,
  material: Material,
}

One difference here is the way the material is stored. The C++ version stores a pointer to the material of the sphere that was hit. This is something that is not so simple in Rust due to Rust’s ownership system. To achieve something similar to the pointer to the material in Rust we would have to have a reference which immutably “borrows” the original data. Since the RayHit structure is short lived, it would be possible to make it borrow the material from the sphere that has been hit, however to do this we would need to annotate the lifetime relationship so that the Rust compiler knows that everything is OK. In this case I was lazy any just copied the material into the RayHit struct. It might not be the most efficient solution but the material’s aren’t that large. For the purposes of this post it might have been more interesting to annotate the lifetime of the material borrow though. Perhaps I will go into this in a subsequent post.

Summing Up

These seemed like some of the more interesting differences between the C++ version and my Rust implementation. There are of course other interesting things but I think this post has got quite long enough. My Rust implementation can be found here and the book’s C++ version here.

Hopefully at some point I will find some time to add some more features to this path tracer and to start on some optimization work with Rayon and SIMD.

I haven’t got around to setting up a comment system on my blog so if you have any comments or questions hit me up on twitter at @bitshifternz.

Microbenching SIMD in Rust

Today I read Hugo Tunius’ blog post Exploring SIMD on Rust, in which after some experimentation he didn’t get the performance boost he expected to see from SIMD. I’ve also been meaning to have more of a play with SIMD so I thought I’d take a look at his git repo and see if I can work out what’s going on.

Hugo mentioned he was having trouble with Bencher, so let’s start there. Running cargo bench gave these results

running 4 tests
test bench_f32            ... bench:         151 ns/iter (+/- 2)
test bench_f32_sse        ... bench:     162,004 ns/iter (+/- 212)
test bench_f32_sse_inline ... bench:         150 ns/iter (+/- 11)
test bench_f64            ... bench:         150 ns/iter (+/- 3)

Something is very wrong here. One SSE benchmark is orders of magnitude slower than all of the other benchmarks. That doesn’t make much sense - time to look at some assembly.

To generate assembly via cargo you can run specify RUSTFLAGS=--emit asm before running cargo bench. Since we’re only interested in the assembly output for now I’m running

RUSTFLAGS="--emit asm" cargo bench --no-run

This generates some AT&T style assembly output to .s files in the target/release/deps directory. Ideally I’d prefer Intel format with demangled symbols but I don’t think rustc --emit gives any control over this.

Comparing the assembly from bench_f32 and bench_f32_sse there’s some clear differences. All of the work happens between bencher functions that record the start and end times - the mangled calls to _ZN3std4time7Instant3now17he141c6f08d993cf9E@PLT and _ZN3std4time7Instant7elapsed17hc3711b876336edbcE@PLT. This is the assembly for bench_f32

	movq	%rdi, %r14
	callq	_ZN3std4time7Instant3now17he141c6f08d993cf9E@PLT
	movq	%rax, 8(%rsp)
	movq	%rdx, 16(%rsp)
	movq	(%r14), %rbx
	testq	%rbx, %rbx
	je	.LBB15_2
	.p2align	4, 0x90
.LBB15_1:
	callq	_ZN7vector314num_iterations17h62f9e330173945acE
	decq	%rbx
	jne	.LBB15_1
.LBB15_2:
	leaq	8(%rsp), %rdi
	callq	_ZN3std4time7Instant7elapsed17hc3711b876336edbcE@PLT
	movq	%rax, 8(%r14)
	movl	%edx, 16(%r14)

And the assembly of bench_f32_sse

	movq	%rdi, %r14
	callq	_ZN3std4time7Instant3now17he141c6f08d993cf9E@PLT
	movq	%rax, 8(%rsp)
	movq	%rdx, 16(%rsp)
	movq	(%r14), %r15
	testq	%r15, %r15
	je	.LBB16_4
	xorl	%r12d, %r12d
	.p2align	4, 0x90
.LBB16_2:
	incq	%r12
	callq	_ZN7vector314num_iterations17h62f9e330173945acE
	movq	%rax, %rbx
	testq	%rbx, %rbx
	je	.LBB16_3
	.p2align	4, 0x90
.LBB16_5:
	movaps	.LCPI16_0(%rip), %xmm0
	movaps	.LCPI16_1(%rip), %xmm1
	callq	_ZN17vector_benchmarks7dot_sse17hc439918e0ab4bdcfE@PLT
	decq	%rbx
	jne	.LBB16_5
.LBB16_3:
	cmpq	%r15, %r12
	jne	.LBB16_2
.LBB16_4:
	leaq	8(%rsp), %rdi
	callq	_ZN3std4time7Instant7elapsed17hc3711b876336edbcE@PLT
	movq	%rax, 8(%r14)
	movl	%edx, 16(%r14)

Without going through everything that’s going on in those listings one is obviously a lot longer than the other, but why? In the first listing there should be a bunch of math happening between the two calls to _ZN3std4time7Instant7elapsed17hebf7091c45b5403bE to calculate a dot product, but there is only a call to num_iterations. The second listing has a call to the mangled function name for dot_sse. The compiler seems to be optimizing away all of the bench_f32* code except for the function call to dot_sse.

Let’s look at the Rust code for bench_f32

fn bench_f32(b: &mut Bencher) {
    b.iter(|| {
               let a: Vector3<f32> = Vector3::new(23.2, 39.1, 21.0);
               let b: Vector3<f32> = Vector3::new(-5.2, 0.1, 13.4);

               (0..num_iterations()).fold(0.0, |acc, i| acc + a.dot(&b));
           });
}

The problem here is the perennial problem with micro-benchmarking suites - is my code actually being run. What is happening is the result of the fold call is discarded. Rust returns the value of the last expression if it’s not followed by a semi-colon. Because the fold ends with a semi-colon, the closure returns nothing, well () to be precise. To fix this and stop the optimizer from very rightly removing unnecessary work we need to either add an explicit return statement or remove the semi-colon following the fold.

Returning from all of the fold calls results in a compile error

LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse41.dpps

The compiler is now trying to generate code for bench_f32_sse_inline but the intrinsic is not available. To make sure SSE4.1 is available I created a .cargo/config file containing the following

[target.'cfg(any(target_arch = "x86", target_arch = "x86_64"))']
rustflags = ["-C", "target-cpu=native", "-C", "target-feature=+sse4.1"]

I have no idea if this is the “correct” way to enable SSE4.1, but it compiles.

The bench results now look like

running 4 tests
test bench_f32            ... bench:      90,383 ns/iter (+/- 2,185)
test bench_f32_sse        ... bench:     486,344 ns/iter (+/- 12,059)
test bench_f32_sse_inline ... bench:      89,788 ns/iter (+/- 3,894)
test bench_f64            ... bench:      89,237 ns/iter (+/- 1,299)

Again examining the assembly output, bench_f32_sse is still slower because dot_sse is not getting inlined. Let’s add #[inline(always)] on all the dot* functions just do be sure.

running 4 tests
test bench_f32            ... bench:      89,652 ns/iter (+/- 1,113)
test bench_f32_sse        ... bench:      89,843 ns/iter (+/- 1,160)
test bench_f32_sse_inline ... bench:      89,648 ns/iter (+/- 1,428)
test bench_f64            ... bench:      89,794 ns/iter (+/- 1,311)

Now we’re getting consistent looking results, but the SSE functions are still not noticeably faster than the scalar code. Time to reexamine the bench_f32_sse assembly yet again. So it’s definitely inlined now and we’re seeing our vector dot product

vmovaps	.LCPI16_0(%rip), %xmm0
vdpps	$113, .LCPI16_1(%rip), %xmm0, %xmm1

But below our dot product there’s a lot of vaddss calls

.LBB16_2:
	addq	$1, %rbx
	callq	_ZN7vector314num_iterations17h62f9e330173945acE
	testq	%rax, %rax
	je	.LBB16_3
	leaq	-1(%rax), %rcx
	movq	%rax, %rsi
	vxorps	%xmm0, %xmm0, %xmm0
	xorl	%edx, %edx
	andq	$7, %rsi
	je	.LBB16_5
	vmovaps	16(%rsp), %xmm1
	.p2align	4, 0x90
.LBB16_7:
	addq	$1, %rdx
	vaddss	%xmm0, %xmm1, %xmm0
	cmpq	%rdx, %rsi
	jne	.LBB16_7
	jmp	.LBB16_8
	.p2align	4, 0x90
.LBB16_3:
	vxorps	%xmm0, %xmm0, %xmm0
	jmp	.LBB16_11
	.p2align	4, 0x90
.LBB16_5:
	vmovaps	16(%rsp), %xmm1
.LBB16_8:
	cmpq	$7, %rcx
	jb	.LBB16_11
	subq	%rdx, %rax
	.p2align	4, 0x90
.LBB16_10:
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	vaddss	%xmm0, %xmm1, %xmm0
	addq	$-8, %rax
	jne	.LBB16_10

So what’s happened here? Yet again the optimizer is being clever. It realizes that the dot product calculation is invariant, so it’s moved it out of the loop. The fold now consists of a loop summing num_iterations of the dot product. Even this has been loop unrolled by the compiler, which is why there are so many vaddss.

In Conclusion

Micro-benchmarks are hard. Even when they’re testing the right thing they can still be wrong due to other influences like other software on the system or different CPU pressures than when the code is run in a real program.

In this case we found that the code that was expected to be benchmarked wasn’t being run at all in most cases. Firstly due to the small omission of a return value and a clever optimizing compiler. When the return was fixed and it was run the dot code was only executed once rather than the expected num_iterations time because again the compiler was smart enough to optimize away the loop invariant.

We still haven’t answered the question is SSE faster but we have at least determined why it doesn’t appear to be in this benchmark. The way I’d go about that would be to generate a Vec of data to dot product and fold on that, rather than an invariant.