In my last post on optimising my Rust path tracer with SIMD I had got withing 10% of my performance target, that is Aras’s C++ SSE4.1 path tracer. From profiling I had determined that the main differences were MSVC using SSE versions of sinf and cosf and differences between Rayon and enkiTS thread pools. The first thing I tried was implement an SSE2 version of sin_cos based off of Julien Pommier’s code that I found via a bit of googling. This was enough to get my SSE4.1 implementation to match the performance of Aras’s SSE4.1 code. I had a slight advantage in that I just call sin_cos as a single function versus separate sin and cos functions, but meh, I’m calling my performance target reached. Final performance results are at the end of this post if you just want to skip to that.
The other part of this post is about Rust’s runtime and compile time CPU feature detection and some wrong turns I took along the way.
Following on from path tracing in parallel with Rayon I had a lot of other optimisations I wanted to try. In particular I want to see if I could match the CPU performance of @aras_p’s C++ path tracer in Rust. He’d done a fair amount of optimising so it seemed like a good target to aim for. To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. I also implemented a live render loop mimicking his.
My initial unoptimized code was processing 10Mrays/s on my laptop. Aras’s code (with GPGPU disabled) was doing 45.5Mrays/s. I had a long way to go from here!
tl;dr did I match the C++ in Rust? Almost. My SSE4.1 version is doing 41.2Mrays/s about 10% slower than the target 45.5Mrays/s running on Windows on my laptop. The long answer is more complicated but I will go into that later.
The path tracer I talked about in my previous post runs on one core, but my laptop’s CPU has 4 physical cores. That seems like an easy way to make this thing faster right? There’s a Rust library called Rayon which provides parallel iterators to divide your data into tasks and run it across multiple threads.
One of the properties of Rust’s type system is it detects shared memory data races at compile time. This property is a product of Rust’s ownership model which does not allow shared mutable state. You can read more about this in the Fearless Concurrency chapter of the Rust Book or for a more formal analysis Securing the Foundations of the Rust Programming Language. As a consequence of this, Rayon’s API also guarantees data-race freedom.
This post will describe how I went about translating a C++ project to Rust, so it’s really intended to be an introduction to Rust for C++ programmers. I will introduce some of the Rust features I used and how they compare to both the C++ used in RTIAW’s code and more “Modern” C++ features that are similar to Rust. I probably won’t talk about ray tracing much at all so if you are interested in learning about that I recommend reading Peter’s book!
Additionally neither the book C++ or my Rust are optimized code, Aras’s blog series covers a lot of different optimizations he’s performed, I have not done that yet. My Rust implementation does appear to perform faster than the C++ (~40 seconds compared to ~90 seconds for a similar sized scene). I have not investigated why this is the case, but I have some ideas which will be covered later. I mostly wanted to check that my code was in the same ball park and it certainly seems to be.
Today I read Hugo Tunius’ blog post Exploring SIMD on Rust, in which after some experimentation he didn’t get the performance boost he expected to see from SIMD. I’ve also been meaning to have more of a play with SIMD so I thought I’d take a look at his git repo and see if I can work out what’s going on.