In C++ there is a common idiom used when writing a low level interface that has
different implementations for multiple architectures. That is to use the
preprocessor to select the appropriate implementation at compile time. This
pattern is frequently used in C++ math libraries.
Fast iteration times are something that many game developers consider to be of
utmost importance. Keeping build times short is a major component of quick
iteration for a programmer. Aside from the actual time spent compiling, any time
you have to wait long enough that you start to lose focus on the activity you
are working on, or you start to get distracted or lose track of what you were
doing which costs you more time.
Thus one of my goals when writing glam was to ensure it was fast to compile.
Rust compile times are known to be a bit slow compared to many other languages,
and I didn’t want to pour fuel on to that particular fire.
As part of writing glam I also wrote mathbench so I could compare
performance with similar libraries. I also always wanted to include build time
comparisons as part of mathbench and I’ve finally got around to doing that
with a new tool called buildbench.
In my last post on optimising my Rust path tracer with SIMD I had got within 10% of my performance target, that is Aras’s C++ SSE4.1 path tracer. From profiling I had determined that the main differences were MSVC using SSE versions of sinf and cosf and differences between Rayon and enkiTS thread pools. The first thing I tried was implement an SSE2 version of sin_cos based off of Julien Pommier’s code that I found via a bit of googling. This was enough to get my SSE4.1 implementation to match the performance of Aras’s SSE4.1 code. I had a slight advantage in that I just call sin_cos as a single function versus separate sin and cos functions, but meh, I’m calling my performance target reached. Final performance results are at the end of this post if you just want to skip to that.
The other part of this post is about Rust’s runtime and compile time CPU feature detection and some wrong turns I took along the way.
Following on from path tracing in parallel with Rayon I had a lot of other optimisations I wanted to try. In particular I want to see if I could match the CPU performance of @aras_p’s C++ path tracer in Rust. He’d done a fair amount of optimising so it seemed like a good target to aim for. To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. I also implemented a live render loop mimicking his.
My initial unoptimized code was processing 10Mrays/s on my laptop. Aras’s code (with GPGPU disabled) was doing 45.5Mrays/s. I had a long way to go from here!
tl;dr did I match the C++ in Rust? Almost. My SSE4.1 version is doing 41.2Mrays/s about 10% slower than the target 45.5Mrays/s running on Windows on my laptop. The long answer is more complicated but I will go into that later.