Optimising path tracing: the last 10%
In my last post on optimising my Rust path tracer with SIMD I had got within 10% of my performance target, that is Aras’s C++ SSE4.1 path tracer. From profiling I had determined that the main differences were MSVC using SSE versions of sinf
and cosf
and differences between Rayon and enkiTS thread pools. The first thing I tried was implement an SSE2 version of sin_cos
based off of Julien Pommier’s code that I found via a bit of googling. This was enough to get my SSE4.1 implementation to match the performance of Aras’s SSE4.1 code. I had a slight advantage in that I just call sin_cos
as a single function versus separate sin
and cos
functions, but meh, I’m calling my performance target reached. Final performance results are at the end of this post if you just want to skip to that.
The other part of this post is about Rust’s runtime and compile time CPU feature detection and some wrong turns I took along the way.
Target feature selection
As I mentioned in my previous post, Rust is about to stabilise several SIMD related features in 1.27. Apart from the SIMD intrinsics themselves, this adds the ability to check for CPU target features (i.e. supported instruction sets) at both compile time and runtime.
The static check involves using #[cfg(target_feature = "<feature>")]
around blocks which turns code on or off at compile time. Here’s an example of compile time feature selection:
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
pub fn add_scalar(a: &[f32], b: &[f32], c: &mut [f32]) {
for ((a, b), c) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
*c = a + b;
}
}
#[cfg(target_feature = "sse2")]
pub fn add_sse2(a: &[f32], b: &[f32], c: &mut [f32]) {
// for simplicity assume length is a multiple of chunk size
for ((a, b), c) in a.chunks(4).zip(b.chunks(4)).zip(c.chunks_mut(4)) {
unsafe {
_mm_storeu_ps(
c.as_mut_ptr(),
_mm_add_ps(
_mm_loadu_ps(a.as_ptr()),
_mm_loadu_ps(b.as_ptr())));
}
}
}
#[cfg(target_feature = "avx2")]
pub fn add_avx2(a: &[f32], b: &[f32], c: &mut [f32]) {
// for simplicity assume length is a multiple of chunk size
for ((a, b), c) in a.chunks(8).zip(b.chunks(8)).zip(c.chunks_mut(8)) {
unsafe {
_mm256_storeu_ps(
c.as_mut_ptr(),
_mm256_add_ps(
_mm256_loadu_ps(a.as_ptr()),
_mm256_loadu_ps(b.as_ptr())));
}
}
}
pub fn add(a: &[f32], b: &[f32], c: &mut [f32]) {
#[cfg(target_feature = "avx2")]
return add_avx2(a, b, c);
#[cfg(target_feature = "sse2")]
return add_sse2(a, b, c);
#[cfg(not(any(target_feature = "avx2", target_feature = "sse2")))]
add_scalar(a, b, c);
}
If you look at the assembly listing on Compiler Explorer you can see both add_scalar
and add_sse2
are both in the assembly output (SSE2 is always available on x86-x64 targets) but add_avx2
is not as AVX2 is not. The add
function has inlined the SSE2 version. The target features available it compile time can be controlled via rustc
flags.
The runtime check uses the #[target_feature = "<feature>"]
attribute that can be used on functions to emit code using that feature. The functions must be marked as unsafe
as if they were executed on a CPU that didn’t support that feature the program would crash. The is_x86_feature_detected!
can be used to determine if the feature is available at runtime and then call the appropriate function. Here’s an example of runtime feature selection:
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
pub fn add_scalar(a: &[f32], b: &[f32], c: &mut [f32]) {
for ((a, b), c) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
*c = a + b;
}
}
#[cfg_attr(any(target_arch = "x86", target_arch = "x86_64"), target_feature(enable = "sse2"))]
pub unsafe fn add_sse2(a: &[f32], b: &[f32], c: &mut [f32]) {
// for simplicity assume length is a multiple of chunk size
for ((a, b), c) in a.chunks(4).zip(b.chunks(4)).zip(c.chunks_mut(4)) {
_mm_storeu_ps(
c.as_mut_ptr(),
_mm_add_ps(
_mm_loadu_ps(a.as_ptr()),
_mm_loadu_ps(b.as_ptr())));
}
}
#[cfg_attr(any(target_arch = "x86", target_arch = "x86_64"), target_feature(enable = "avx2"))]
pub unsafe fn add_avx2(a: &[f32], b: &[f32], c: &mut [f32]) {
// for simplicity assume length is a multiple of chunk size
for ((a, b), c) in a.chunks(8).zip(b.chunks(8)).zip(c.chunks_mut(8)) {
_mm256_storeu_ps(
c.as_mut_ptr(),
_mm256_add_ps(
_mm256_loadu_ps(a.as_ptr()),
_mm256_loadu_ps(b.as_ptr())));
}
}
pub fn add(a: &[f32], b: &[f32], c: &mut [f32]) {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
{
if is_x86_feature_detected!("avx2") {
return unsafe { add_avx2(a, b, c) };
}
if is_x86_feature_detected!("sse2") {
return unsafe { add_sse2(a, b, c) };
}
}
add_scalar(a, b, c);
}
Looking again at the assembly listing on Compiler Explorer you can see a call to std::stdsimd::arch::detect::os::check_for
which depending on the result jumps to the AVX2 implementation or passes through to the inlined SSE2 implementation (again, because SSE2 is always available on x86-64).
I do not know what the Rust calling convention is but there seems to be a lot of saving and restoring registers to and from the stack around the call to check_for
.
example::add:
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
pushq %rax
movq %r9, %r12
movq %r8, %r14
movq %rcx, %r13
movq %rdx, %r15
movq %rsi, %rbp
movq %rdi, %rbx
movl $15, %edi
callq std::stdsimd::arch::detect::os::check_for@PLT
testb %al, %al
je .LBB3_1
movq %rbx, %rdi
movq %rbp, %rsi
movq %r15, %rdx
movq %r13, %rcx
movq %r14, %r8
movq %r12, %r9
addq $8, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
jmp example::add_avx2@PLT
So there’s definitely some overhead in performing this check at runtime and that’s certainly noticeable comparing the results of my compile time AVX2 performance (51.1Mrays/s) to runtime checked AVX2 performance (47.6Mrays/s). If you know exactly what hardware your program is going to be running on it’s better to use the compile time method. In my case, the hit
method is called millions of times so this overhead adds up. If I could perform this check in an outer function then it should reduce the overhead of the runtime check.
Update - 2018-06-23
There was some great discussion on /r/rust about this point. It seems like this check should be getting inlined and that it wasn’t is probably a bug. A PR has already landed in the stdsimd
library to fix this, thanks to gnzlbg. It might take a while for that change to make it into stable Rust. In the interim there were a couple of other good suggestions on how I could minimise the overhead. One was to store the target feature detection result in an enum and just use a match to branch to the appropriate function. This was on the premise that the branch should get predicted and avoiding an indirect jump (function pointer) would allow the compiler to optimise better. This has certainly worked out and now my runtime target feature code performance is the same as the compile time version.
Wrappers and runtime feature selection
In my previous post I also talked about my SIMD wrapper. The purpose of this wrapper was to provide an interface that uses the widest available registers to it, the idea being I write my ray spheres collision test function using my wrapped SIMD types. The main benefits being that I have single implementation of my hit
function using the wrapper type and also I get some nice ergonomics like being able to use operators like +
or -
on my wrapper types. Not to mention enforcing some type safety! For example introducing a SIMD boolean type that is returned by comparison functions and is then passed into blend functions - making the implicit SSE conventions explicit through the type system.
That’s all very nice, but it only worked at compile time. The catch being runtime #[target_feature(enable)]
only works on functions. My compile time wrapper just imported a different module depending on what features were available. My wrapper types had a lot of functions. On top of that the hit
function needed to know which version of the wrapper it needed to call.
I was interested in implementing a runtime option. My first attempt at SIMD was using the runtime target_feature
method but I had dropped it to try and write a wrapper. I was skeptical that I would be able to get my wrapper working with runtime feature detection, at least not without things getting very complicated - but I thought I’d give it a try. Keep in mind, I haven’t completed this code, I just went a long way down this path before I moved on to writing separate scalar, SSE4.1 and AVX2 implementations of my hit function.
I figured one approach would be to make the hit function generic and then specify the SSE4.1 or AVX2 implementations as type parameters. I could then choose the appropriate branch at runtime. For this to work I needed to make traits for my wrapper types and all the trait methods needed to be unsafe, due to the target_feature
requirement. Possibly these traits could have been simplified, but this is what I ended up with:
pub trait Bool32xN<T>: Sized + Copy + Clone + BitAnd + BitOr {
unsafe fn num_lanes() -> usize;
unsafe fn unwrap(self) -> T;
unsafe fn to_mask(self) -> i32;
}
pub trait Int32xN<T, BI, B: Bool32xN<BI>>: Sized + Copy + Clone + Add + Sub + Mul {
unsafe fn num_lanes() -> usize;
unsafe fn unwrap(self) -> T;
unsafe fn splat(i: i32) -> Self;
unsafe fn load_aligned(a: &[i32]) -> Self;
unsafe fn load_unaligned(a: &[i32]) -> Self;
unsafe fn store_aligned(self, a: &mut [i32]);
unsafe fn store_unaligned(self, a: &mut [i32]);
unsafe fn indices() -> Self;
unsafe fn blend(lhs: Self, rhs: Self, cond: B) -> Self;
}
pub trait Float32xN<T, BI, B: Bool32xN<BI>>: Sized + Copy + Clone + Add + Sub + Mul + Div {
unsafe fn num_lanes() -> usize;
unsafe fn unwrap(self) -> T;
unsafe fn splat(s: f32) -> Self;
unsafe fn from_x(v: Vec3) -> Self;
unsafe fn from_y(v: Vec3) -> Self;
unsafe fn from_z(v: Vec3) -> Self;
unsafe fn load_aligned(a: &[f32]) -> Self;
unsafe fn load_unaligned(a: &[f32]) -> Self;
unsafe fn store_aligned(self, a: &mut [f32]);
unsafe fn store_unaligned(self, a: &mut [f32]);
unsafe fn sqrt(self) -> Self;
unsafe fn hmin(self) -> f32;
unsafe fn eq(self, rhs: Self) -> B;
unsafe fn gt(self, rhs: Self) -> B;
unsafe fn lt(self, rhs: Self) -> B;
unsafe fn dot3(x0: Self, x1: Self, y0: Self, y1: Self, z0: Self, z1: Self) -> Self;
unsafe fn blend(lhs: Self, rhs: Self, cond: B) -> Self;
}
The single letter type parameters are a bit cryptic, I was thinking about better names but was deferring that problem until later. The hit
function signature also got quite complicated:
pub unsafe fn hit_simd<BI, FI, II, B, F, I>(
&self,
ray: &Ray,
t_min: f32,
t_max: f32,
) -> Option<(RayHit, u32)>
where
B: Bool32xN<BI> + BitAnd<Output = B>,
F: Float32xN<FI, BI, B> + Add<Output = F> + Sub<Output = F> + Mul<Output = F>,
I: Int32xN<II, BI, B> + Add<Output = I> {
// impl
}
Again with the cryptic type parameter names. These basically represented the wrapper type for f32
, bool
and i32
and the arch specific SIMD type, e.g. __m128
and __m256
. This got pretty close to compiling but at this point I the operators didn’t compile and because the operators defined in std::ops
they aren’t labelled as unsafe
. I guess I could have wrapped an unsafe function call from the safe op trait implementation but that seemed pretty unsafe. At that point I gave up and added an unwrapped AVX2 version of the hit function. It took about an hour compared to several hours I’d sunk into my generic approach.
There are obvious downsides to maintaining multiple implementations of the same function. Adding some tests would help mitigate that but if I end up writing more types of ray collisions or add support for another arch I’m obviously adding a lot more code and potential sources of (maybe duplicated) bugs. I did feel at this point things were getting too complicated for not much gain in the scheme of my little path tracing project.
I thought this story might be interesting to people wanting to use the runtime target feature selection in Rust. It is nice having one executable that can take advantage of the CPU features that it’s running on but it also means some constraints on how you author your code. The unfinished runtime wrapper code is here if you are curious.
Final performance results
Test results are from my laptop running Window 10 home. I’m compiling with cargo build --release
of course with rustc 1.28.0-nightly (71e87be38 2018-05-22)
. My performance numbers for each iteration of my path tracer are:
Feature | Mrays/s |
---|---|
Aras’s C++ SSE4.1 version | 45.5 |
Static SSE4.1 w/ LTO | 45.8 |
Static AVX2 w/ LTO | 52.1 |
Dynamic (AVX2) w/ LTO | 52.1 |
Update 2018-06-23 Updated performance numbers for static and dynamic versions of the code - thanks to some caching of the target feature value the performance is about this same.
The static branch lives here and the dynamic branch here. My machine supports AVX2 so that’s what the dynamic version ends up using.
In summary, if you don’t know what CPU your code is going to run on you could get a nice little boost by checking target features at runtime at the loss of a bit of flexibility around how you structure your code. If you have a fixed hardware target then it’s better to compile for that hardware and avoid the overhead and code restrictions of target_feature
.