Microbenching SIMD in Rust
Today I read Hugo Tunius’ blog post Exploring SIMD on Rust, in which after some experimentation he didn’t get the performance boost he expected to see from SIMD. I’ve also been meaning to have more of a play with SIMD so I thought I’d take a look at his git repo and see if I can work out what’s going on.
Hugo mentioned he was having trouble with Bencher, so let’s start there. Running cargo bench
gave these results
running 4 tests
test bench_f32 ... bench: 151 ns/iter (+/- 2)
test bench_f32_sse ... bench: 162,004 ns/iter (+/- 212)
test bench_f32_sse_inline ... bench: 150 ns/iter (+/- 11)
test bench_f64 ... bench: 150 ns/iter (+/- 3)
Something is very wrong here. One SSE benchmark is orders of magnitude slower than all of the other benchmarks. That doesn’t make much sense - time to look at some assembly.
To generate assembly via cargo you can run specify RUSTFLAGS=--emit asm
before running cargo bench
. Since we’re only interested in the assembly output for now I’m running
RUSTFLAGS="--emit asm" cargo bench --no-run
This generates some AT&T style assembly output to .s
files in the target/release/deps
directory. Ideally I’d prefer Intel format with demangled symbols but I don’t think rustc --emit
gives any control over this.
Comparing the assembly from bench_f32
and bench_f32_sse
there’s some clear differences. All of the work happens between bencher functions that record the start and end times - the mangled calls to _ZN3std4time7Instant3now17he141c6f08d993cf9E@PLT
and _ZN3std4time7Instant7elapsed17hc3711b876336edbcE@PLT
. This is the assembly for bench_f32
movq %rdi, %r14
callq _ZN3std4time7Instant3now17he141c6f08d993cf9E@PLT
movq %rax, 8(%rsp)
movq %rdx, 16(%rsp)
movq (%r14), %rbx
testq %rbx, %rbx
je .LBB15_2
.p2align 4, 0x90
.LBB15_1:
callq _ZN7vector314num_iterations17h62f9e330173945acE
decq %rbx
jne .LBB15_1
.LBB15_2:
leaq 8(%rsp), %rdi
callq _ZN3std4time7Instant7elapsed17hc3711b876336edbcE@PLT
movq %rax, 8(%r14)
movl %edx, 16(%r14)
And the assembly of bench_f32_sse
movq %rdi, %r14
callq _ZN3std4time7Instant3now17he141c6f08d993cf9E@PLT
movq %rax, 8(%rsp)
movq %rdx, 16(%rsp)
movq (%r14), %r15
testq %r15, %r15
je .LBB16_4
xorl %r12d, %r12d
.p2align 4, 0x90
.LBB16_2:
incq %r12
callq _ZN7vector314num_iterations17h62f9e330173945acE
movq %rax, %rbx
testq %rbx, %rbx
je .LBB16_3
.p2align 4, 0x90
.LBB16_5:
movaps .LCPI16_0(%rip), %xmm0
movaps .LCPI16_1(%rip), %xmm1
callq _ZN17vector_benchmarks7dot_sse17hc439918e0ab4bdcfE@PLT
decq %rbx
jne .LBB16_5
.LBB16_3:
cmpq %r15, %r12
jne .LBB16_2
.LBB16_4:
leaq 8(%rsp), %rdi
callq _ZN3std4time7Instant7elapsed17hc3711b876336edbcE@PLT
movq %rax, 8(%r14)
movl %edx, 16(%r14)
Without going through everything that’s going on in those listings one is obviously a lot longer than the other, but why? In the first listing there should be a bunch of math happening between the two calls to _ZN3std4time7Instant7elapsed17hebf7091c45b5403bE
to calculate a dot product, but there is only a call to num_iterations
. The second listing has a call to the mangled function name for dot_sse
. The compiler seems to be optimizing away all of the bench_f32*
code except for the function call to dot_sse
.
Let’s look at the Rust code for bench_f32
fn bench_f32(b: &mut Bencher) {
b.iter(|| {
let a: Vector3<f32> = Vector3::new(23.2, 39.1, 21.0);
let b: Vector3<f32> = Vector3::new(-5.2, 0.1, 13.4);
(0..num_iterations()).fold(0.0, |acc, i| acc + a.dot(&b));
});
}
The problem here is the perennial problem with micro-benchmarking suites - is my code actually being run. What is happening is the result of the fold
call is discarded. Rust returns the value of the last expression if it’s not followed by a semi-colon. Because the fold
ends with a semi-colon, the closure returns nothing, well ()
to be precise. To fix this and stop the optimizer from very rightly removing unnecessary work we need to either add an explicit return statement or remove the semi-colon following the fold
.
Returning from all of the fold
calls results in a compile error
LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse41.dpps
The compiler is now trying to generate code for bench_f32_sse_inline
but the intrinsic is not available. To make sure SSE4.1 is available I created a .cargo/config
file containing the following
[target.'cfg(any(target_arch = "x86", target_arch = "x86_64"))']
rustflags = ["-C", "target-cpu=native", "-C", "target-feature=+sse4.1"]
I have no idea if this is the “correct” way to enable SSE4.1, but it compiles.
The bench results now look like
running 4 tests
test bench_f32 ... bench: 90,383 ns/iter (+/- 2,185)
test bench_f32_sse ... bench: 486,344 ns/iter (+/- 12,059)
test bench_f32_sse_inline ... bench: 89,788 ns/iter (+/- 3,894)
test bench_f64 ... bench: 89,237 ns/iter (+/- 1,299)
Again examining the assembly output, bench_f32_sse
is still slower because dot_sse
is not getting inlined. Let’s add #[inline(always)]
on all the dot*
functions just do be sure.
running 4 tests
test bench_f32 ... bench: 89,652 ns/iter (+/- 1,113)
test bench_f32_sse ... bench: 89,843 ns/iter (+/- 1,160)
test bench_f32_sse_inline ... bench: 89,648 ns/iter (+/- 1,428)
test bench_f64 ... bench: 89,794 ns/iter (+/- 1,311)
Now we’re getting consistent looking results, but the SSE functions are still not noticeably faster than the scalar code. Time to reexamine the bench_f32_sse
assembly yet again. So it’s definitely inlined now and we’re seeing our vector dot product
vmovaps .LCPI16_0(%rip), %xmm0
vdpps $113, .LCPI16_1(%rip), %xmm0, %xmm1
But below our dot product there’s a lot of vaddss
calls
.LBB16_2:
addq $1, %rbx
callq _ZN7vector314num_iterations17h62f9e330173945acE
testq %rax, %rax
je .LBB16_3
leaq -1(%rax), %rcx
movq %rax, %rsi
vxorps %xmm0, %xmm0, %xmm0
xorl %edx, %edx
andq $7, %rsi
je .LBB16_5
vmovaps 16(%rsp), %xmm1
.p2align 4, 0x90
.LBB16_7:
addq $1, %rdx
vaddss %xmm0, %xmm1, %xmm0
cmpq %rdx, %rsi
jne .LBB16_7
jmp .LBB16_8
.p2align 4, 0x90
.LBB16_3:
vxorps %xmm0, %xmm0, %xmm0
jmp .LBB16_11
.p2align 4, 0x90
.LBB16_5:
vmovaps 16(%rsp), %xmm1
.LBB16_8:
cmpq $7, %rcx
jb .LBB16_11
subq %rdx, %rax
.p2align 4, 0x90
.LBB16_10:
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
vaddss %xmm0, %xmm1, %xmm0
addq $-8, %rax
jne .LBB16_10
So what’s happened here? Yet again the optimizer is being clever. It realizes that the dot product calculation is invariant, so it’s moved it out of the loop. The fold
now consists of a loop summing num_iterations
of the dot product. Even this has been loop unrolled by the compiler, which is why there are so many vaddss
.
In Conclusion
Micro-benchmarks are hard. Even when they’re testing the right thing they can still be wrong due to other influences like other software on the system or different CPU pressures than when the code is run in a real program.
In this case we found that the code that was expected to be benchmarked wasn’t being run at all in most cases. Firstly due to the small omission of a return value and a clever optimizing compiler. When the return was fixed and it was run the dot code was only executed once rather than the expected num_iterations
time because again the compiler was smart enough to optimize away the loop invariant.
We still haven’t answered the question is SSE faster but we have at least determined why it doesn’t appear to be in this benchmark. The way I’d go about that would be to generate a Vec
of data to dot product and fold on that, rather than an invariant.