Introducing glam and mathbench
glam
is a simple and fast Rust linear algebra library for games and
graphics.
mathbench
is a set of unit tests and benchmarks comparing the
performance of glam
with the popular Rust linear algebra libraries cgmath
and nalgebra
.
The following is a table of benchmarks produced by mathbench
comparing glam
performance to cgmath
and nalgebra
on f32
data.
benchmark | glam | cgmath | nalgebra |
---|---|---|---|
euler 2d | 9.063 us | 8.977 us | 27.94 us |
euler 3d | 15.78 us | 29.52 us | 195.4 us |
mat2 determinant | 1.3140 ns | 1.0528 ns | 1.0825 ns |
mat2 inverse | 2.0524 ns | 2.6565 ns | 2.7056 ns |
mat2 mul mat2 | 2.1091 ns | 3.0495 ns | 3.6305 ns |
mat2 transform vec2 | 2.1865 ns | 2.4035 ns | 6.8877 ns |
mat2 transpose | 0.7117 ns | 1.3231 ns | 1.8297 ns |
mat3 determinant | 2.2130 ns | 2.5348 ns | 2.5766 ns |
mat3 inverse | 7.7296 ns | 7.9519 ns | 9.3100 ns |
mat3 mul mat3 | 4.9371 ns | 9.4619 ns | 8.3076 ns |
mat3 transform vec3 | 2.2893 ns | 4.0822 ns | 8.0838 ns |
mat3 transpose | 2.0035 ns | 3.5175 ns | 8.9548 ns |
mat4 determinant | 7.9524 ns | 12.2581 ns | 54.2954 ns |
mat4 inverse | 21.0455 ns | 43.6983 ns | 55.8788 ns |
mat4 mul mat4 | 6.7517 ns | 9.2621 ns | 16.2498 ns |
mat4 transform vec4 | 2.5298 ns | 3.5734 ns | 4.1222 ns |
mat4 transpose | 2.7026 ns | 8.0926 ns | 11.5411 ns |
quat conjugate | 0.8866 ns | 1.8263 ns | 1.7650 ns |
quat mul quat | 2.8647 ns | 5.6678 ns | 5.4807 ns |
quat transform vec3 | 4.2107 ns | 6.5798 ns | 7.0398 ns |
vec3 cross | 2.0677 ns | 2.8890 ns | 2.8725 ns |
vec3 dot | 1.3789 ns | 1.6669 ns | 1.6666 ns |
vec3 length | 2.0412 ns | 2.0508 ns | 88.1277 ns |
vec3 normalize | 4.0433 ns | 4.1260 ns | 87.2455 ns |
These benchmarks were performed on my Intel i7-4710HQ CPU on Linux. They were
compiled with the stable 1.36 Rust compiler. Lower (better) numbers are
highlighted. I hope it’s clear that glam
is outperforming cgmath
and
nalgebra
in most of these benchmarks.
Generally glam
functionality is the same as the libraries for the functions
tested here. One exception is cgmath
and nalgebra
matrix inverse functions
return an Option
type which is None
if the matrix wasn’t invertible, glam
assumes the input was invertible and returns a Mat4
. There may be other
differences I’m not aware of.
Like glam
, cgmath
also targets games and graphics. nalgebra
is a general
purpose linear algebra library which has a broader audience than games and
graphics programmers.
There are some strange outliers in the mathbench
results, primarily nalgebra
vec3 length
and vec3 normalize
. I haven’t looked into why they’re slow, if
it’s a problem with the bench or nalgebra
but given dot
is perfectly fast it
seems odd, the only difference between dot
and length
should be a square root.
The reason glam
is faster is primarily due to it using SIMD internally for all
types with the exception of Vec2
.
You can see a similar table at the bottom of this post with glam
SIMD
support disabled. TL;DR it’s similar performance to cgmath
which is expected.
See the full mathbench report for more detailed results.
Why write another math library?
My goal writing glam
and the trade offs I made are primarily being focused on
good f32
performance by utilising SIMD and a simple API. This is at the
expense of genericity, there is no Vector3<T>
type, just Vec3
which is f32
based. It would be possible to support f64
or generic types in the future but
it’s not a high priority for me right now, in my experience most games only use
f32
.
glam
also avoids baking mathematical correctness into the type system. There
are no Point3
or UnitQuaternion
types for example, it is up to the
programmer if they want their Vector3
to behave as a point or a vector or if
their quaternion is normalised. This decision sacrifices enforced runtime
correctness via the type system for performance and a simple API.
SIMD
I noticed when optimising a path tracer was that I got some good performance
wins from implementing a Vector3
with SSE2
(see SIMD path tracing). I
wanted to take that further and build a full library that utilised SIMD for
vector, matrix and quaternion types. I’d seen Rust math libraries that use SIMD
for some functions but I hadn’t see one that uses SIMD vectors for storage.
SIMD primer
SIMD stands for Single Instruction Multiple Data. The multiple data part varies
depending on CPU. Recent Intel CPUs have SIMD instruction sets that can operate
on 128-bit (SSE), 256-bit (AVX) and 512-bit data sizes. That is 4, 8 and 16
f32
values processed with a single instruction.
Using a simple summation as an example, the difference between the scalar and SIMD operations is illustrated below.
With conventional scalar operations, four add instructions must be executed one after another to obtain the sums. Meanwhile, SIMD uses only one add instruction to achieve the same result. Requiring fewer instructions to process a given mass of data, SIMD operations yield higher efficiency than scalar operations.
SIMD values are typically stored in a special type. In the case of SSE2, the
__m128
type is used to store 4 floats. One significant difference between this
type and say [f32; 4]
is __m128
is 16 byte aligned. While you can load
unaligned data into SSE2 registers, this is slower than loading aligned data.
There are some down sides to using SIMD for a vector math library. One thing you
may notice with the above example is there are 4 values, while a Vec3
only
contains 3 values. The 4th value is effectively ignored and is just wasted
space.
Another down side is while SIMD is great for adding two vectors
together, “horizontal operations” are awkward and not so fast. An example of a
horizontal operation would be to sum the values A0
, A1
, A2
and A3
, or
more commonly in a vector math library, performing a dot product is a horizontal
operation.
If your CPU supported wider instruction sets such as AVX2, a Vec3
can’t take advantage of it, after all it only has 3 floats even if your
instruction set can operate on 8.
The most effective way to take advantage of SIMD is to structure your data as Structure of Arrays (SoA) and feed that through the widest SIMD vectors you have on your CPU. What I mean by that is instead of an Array of Structs (AoS):
struct Vec3AoS {
xyz: Vec<(f32, f32, f32)>,
}
You store each component of the vector in it’s own array:
struct Vec3SoA {
x: Vec<f32>,
y: Vec<f32>,
z: Vec<f32>,
}
Then load 4 x
, y
and z
values at a time and add them. glam
does not help
you with that.
For more information on doing SIMD the “right way” I recommend reading through this GDC2015 presentation on SIMD at Insomniac Games.
You should always be able to out perform glam
by rewriting your data
structures to be SoA and using SIMD instructions directly, I still think it’s
valuable to have a math library that performs well if you haven’t done that
work.
Although 25% of glam::Vec3
is wasted space, it still performs better than the
equivalent scalar code since we are still operating on 3 float with one
instructions a lot of the time. And even though horizontal operations such as
dot products are awkward, they still have slightly better performance than the
scalar equivalent as you can see in the mathbench
results above for Vec3
dot
.
API design
Built around SIMD
Many glam
types use __m128
for data storage internally to get the best SSE2
performance. This has a few implications for API design.
The __m128
is opaque. You do not have direct access to the underlying
f32
components. All component data must be read and written via accessors. For
example this is how the get and set of the y
component of a glam::Vec3
are
implemented:
impl Vec3 {
#[inline]
pub fn y(self) -> f32 {
unsafe { _mm_cvtss_f32(_mm_shuffle_ps(self.0, self.0, 0b01_01_01_01)) }
}
#[inline]
pub fn set_y(&mut self, y: f32) {
unsafe {
let mut t = _mm_move_ss(self.0, _mm_set_ss(y));
t = _mm_shuffle_ps(t, t, 0b11_10_00_00);
self.0 = _mm_move_ss(t, self.0);
}
}
}
These getter and setter methods are using SSE2 intrinsics to load and store
scalar values to and from the y lane of the __m128
vector.
Avoiding complexity
I wanted to come up with an API which is very low friction for developers to use. Something that covers the common needs for someone working in games and graphics and doesn’t require much effort learn. I also wanted something to was easy for me to write.
From the outset I wanted to avoid using traits and generics. Traits and generics
are wonderful language features, but I didn’t see a great reason to use them in
glam
.
Using SIMD types for storage would complicate generics, as I think 2 generic
types would be required, a scalar type (e.g. f32
) and a generic for the
storage type (e.g. __m128
if available), that already sounds complicated!
That’s not to say that glam
will never contain vectors of integer or generic
types, but for the sake of simplicity I wanted to avoid them for a while until
the API feels stable and there’s a compelling reason to use them.
Feeling Rusty
I also wanted to make glam
“Rusty”. One thing I’ve noticed about the Rust
standard library is everything is a method, for example sin
is a method on
f32
and f64
. It feels a bit weird coming from other languages. It turns out
there are reasons for this, from the Rust API guidelines:
Methods have numerous advantages over functions:
- They do not need to be imported or qualified to be used: all you need is a value of the appropriate type.
- Their invocation performs auto-borrowing (including mutable borrows).
- They make it easy to answer the question “what can I do with a value of type T” (especially when using rustdoc).
- They provide self notation, which is more concise and often more clearly conveys ownership distinctions.
Things like a.lerp(b, 0.5)
feel a bit odd, but I could always add functions
for this kind of thing in addition to the methods.
I’ve tried to follow the Rust API Guidelines in general.
Mathematical conventions
glam
interprets vectors as column matrices (also known as column vectors)
meaning when transforming a vector with a matrix the matrix goes on the left,
e.g. v' = Mv
. DirectX uses row vectors, OpenGL uses column vectors. There are
pros and cons to both.
Matrices are stored in column major format. Each column vector is stored in contiguous memory.
Rotations follow the left-hand rule.
Some libraries support both column and row vectors my preference with glam
is
to pick one convention and stick with it (of course I didn’t do that, I started
with row vectors and switched to column vectors).
Test coverage
I wanted to aim for 100% test coverage, especially because there are multiple implementations of many types depending on whether SIMD is available or not.
To determine if I actually did have 100% test coverage I’ve been using a cargo
plugin called tarpaulin
. It only supports x86_64 processors running
Linux and does give some false negatives, but it’s been pretty good for my
needs. I have it integrated into my travis-ci
build and posting results to
coveralls.io
.
According to coveralls.io
glam
has 87% test coverage, I think the real
figure is probably a bit higher due to tarpaulin
reporting some lines being
untested when they actually are.
Performance
If you want to say your library is fast, you better be measuring performance!
Micro-benchmarking with Criterion.rs
I’ve primarily been monitoring the performance of glam
using the criterion
crate. criterion
fills a similar niche to Rust’s built-in bench
but unlike
bench
it works on stable. It also has a bunch of other nice features.
Micro-benchmarking is a useful indication of performance, but it’s not necessarily going to tell you the full story of what code will perform like under less artificial conditions (benchmarks tend to be quite limited in scope). All the same, they’re definitely better than no metrics and I’ve found them to be a pretty reliable source of performance information when optimising some code.
Comparing with other libraries
glam
has a bunch of criterion benchmarks that are useful for optimising glam
but they don’t tell me how glam
performs relative to other math libraries.
I created mathbench
for this purpose. mathbench
is a collection of unit
tests and benchmarks. The unit tests are to check that glam
is producing the
same output as other libraries (with some floating point tolerance) and the
benchmarks compare performance. The table at the top of this post is produced
from these benchmarks.
mathbench
could also be of use to other library authors to see how their
library is performing and for performing optimisations.
One of the nice features of criterion
is you can set up one benchmark that
runs multiple functions on the same data and produces a report comparing the
results. Here’s an example from mathbench
:
fn bench_mat4_mul_mat4(c: &mut Criterion) {
use criterion::Benchmark;
use std::ops::Mul;
c.bench(
"mat4 mul mat4",
Benchmark::new("glam", |b| {
use glam::Mat4;
bench_binop!(b, op => mul_mat4, ty1 => Mat4, ty2 => Mat4)
})
.with_function("cgmath", |b| {
use cgmath::Matrix4;
bench_binop!(b, op => mul, ty1 => Matrix4<f32>, ty2 => Matrix4<f32>)
})
.with_function("nalgebra", |b| {
use nalgebra::Matrix4;
bench_binop!(b, op => mul, ty1 => Matrix4<f32>, ty2 => Matrix4<f32>)
}),
);
}
If you have gnuplot
installed criterion
will produce reports with charts
like the violin plot below:
Viewing assembly
While mathbench
is a library, most of it’s utility is from the tests and
benchmarks. One use I have had for the library though is creating public
functions for inspecting assembly with cargo-asm
. This is primarily because
it was the easiest way to get cargo-asm
to show me something.
For example, in mathbench
I’ve defined a simple wrapper function for
glam::Mat4 * glam::Vec4
:
pub fn glam_mat4_mul_vec4(lhs: &glam::Mat4, rhs: &glam::Vec4) -> glam::Vec4 {
*lhs * *rhs
}
And then run cargo asm mathbench::glam_mat4_mul_vec4
to get the following
output:
mov rax, rdi
movaps xmm0, xmmword, ptr, [rdx]
movaps xmm1, xmm0
shufps xmm1, xmm0, 0
mulps xmm1, xmmword, ptr, [rsi]
movaps xmm2, xmm0
shufps xmm2, xmm0, 85
mulps xmm2, xmmword, ptr, [rsi, +, 16]
addps xmm2, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm0, 170
mulps xmm1, xmmword, ptr, [rsi, +, 32]
addps xmm1, xmm2
shufps xmm0, xmm0, 255
mulps xmm0, xmmword, ptr, [rsi, +, 48]
addps xmm0, xmm1
movaps xmmword, ptr, [rdi], xmm0
ret
It’s a pretty convenient way of viewing assembly, with the catch that you can’t
see #[inline]
blocks.
Profiling benchmarks
You can run individual benchmarks through a profiler to get a better
understanding of their performance. When you run cargo bench
it prints the
path to each benchmark executable that it is running. You can pass the same
parameters to this executable that you pass to cargo bench
.
For example, if I wanted to profile the glam
Mat4
inverse
method I could
run this command via my preferred profiler:
target/release/deps/mat4bench-e528128e2a9ccbc9 "mat4 inverse/glam"
There will be a bit of noise from the benchmarking code but it’s usually not too
hard to work out what is what. If your code is being inlined then it will
probably appear inside criterion::Bencher::iter
.
What next for glam
I consider glam
to be a minimum viable math library right now, so I’ll slowly
add more functionality over time. That will depend a bit on if other people
adopt it and if so what features they are missing.
The main outstanding thing for me right now is documenting what is there.
There are of course more optimisations that could be done.
I’m interested in using the packed_simd
crate instead of using SSE2
directly as packed_simd
supports multiple architectures out of the box.
However packed_simd
currently requires nightly and I’m not sure what the
development status of this crate is right now.
Inspirations
There were many inspirations for the interface and internals of glam
from the
Rust and C++ worlds. In no particular order:
- How to write a maths library in 2016 via Aras Pranckevičius’s pathtracer blog series
- Realtime Math - header only C++11 with SSE and NEON SIMD intrinsic support
- DirectXMath - header only SIMD C++ linear algebra library for use in games and graphics apps
- DirectX Tool Kit SimpleMath - simplified C++ wrapper around DirectXMath
- GLM - header only C++ library based on the GLSL specification
packed_simd
- mentioned above, I like the interfacecgmath
- a Rust linear algebra library for games and graphicsnalgebra
- a general purpose Rust linear algebra library
The Rust ecosystem is awesome
Writing Rust is purely recreational for me. My day job for the last 14 years has been writing C++ game engine and gameplay code.
It boggles my mind how much great tooling I have at my fingertips with Rust that either comes with the Rust installer or is easily accessible. Things like:
- A common build system via
cargo build
- A common packaging system via crates.io and
cargo update
- Built in code formatting via
cargo fmt
- Built in linting via
cargo clippy
- Built in testing framework via
cargo test
- High quality third party micro-benchmarking -
criterion
andcargo bench
- Code coverage -
tarpaulin
andcoveralls.io
- Continuous integration -
travis-ci
Setting up and maintaining this kind of infrastructure for a C++ project is a huge amount of work, especially for something I’m just doing in my evenings.
Rust is pretty new and there are certainly a lot of gaps in the tooling compared to established languages like C++, but the out of the box experience is far far superior.
glam without SSE2
glam
’s performance is around on par with cgmath
if SSE2 is disabled using
the scalar-math
feature. Some glam
functions got faster without SIMD.
benchmark | glam | cgmath | nalgebra |
---|---|---|---|
euler 2d | 9.033 us | 8.985 us | 26.62 us |
euler 3d | 28.89 us | 29.63 us | 195.6 us |
mat2 determinant | 1.0632 ns | 1.0544 ns | 1.0545 ns |
mat2 inverse | 2.2956 ns | 2.7002 ns | 2.6893 ns |
mat2 mul mat2 | 3.0813 ns | 3.0256 ns | 3.6432 ns |
mat2 transform vec2 | 2.4044 ns | 2.4022 ns | 6.9133 ns |
mat2 transpose | 1.3312 ns | 1.3319 ns | 1.8201 ns |
mat3 determinant | 2.5810 ns | 2.5352 ns | 2.5799 ns |
mat3 inverse | 10.5985 ns | 8.0039 ns | 9.3055 ns |
mat3 mul mat3 | 9.5327 ns | 9.4730 ns | 7.9314 ns |
mat3 transform vec3 | 4.0703 ns | 4.0665 ns | 8.0555 ns |
mat3 transpose | 3.5448 ns | 3.5228 ns | 8.9625 ns |
mat4 determinant | 7.7088 ns | 12.5664 ns | 54.2749 ns |
mat4 inverse | 31.3339 ns | 43.7255 ns | 55.6187 ns |
mat4 mul mat4 | 9.2925 ns | 9.3097 ns | 15.6004 ns |
mat4 transform vec4 | 3.5887 ns | 3.6127 ns | 4.1613 ns |
mat4 transpose | 7.6722 ns | 8.6266 ns | 10.7375 ns |
quat conjugate | 1.7617 ns | 1.7622 ns | 1.7684 ns |
quat mul quat | 5.1730 ns | 5.4205 ns | 5.5644 ns |
quat transform vec3 | 7.1421 ns | 6.6010 ns | 7.0346 ns |
vec3 cross | 2.8412 ns | 2.8485 ns | 3.2746 ns |
vec3 dot | 1.6708 ns | 1.5822 ns | 1.6566 ns |
vec3 length | 2.0307 ns | 2.0315 ns | 84.6475 ns |
vec3 normalize | 4.1255 ns | 4.1301 ns | 85.3411 ns |