Anukari on the CPU (part 2: CPU optimization)
Captain's Log: Stardate 79317.7
In part 1 of this series of posts, I mentioned that I was surprised to find that naively running my GPU code on the CPU was only 5x slower, when I thought it would be 100x slower. In this post I will explain how I ended up making the CPU implementation much faster than on the GPU.
First approach: spot-vectorization
As mentioned in part 1, I got the original GPU code compiled for the CPU, and then wrote a simple driver to call into this code and run the simulation (in lieu of the code that set up and invoked the GPU kernel).
As you might imagine, Anukari, being a 3D physics simulation, does a lot of arithmetic on float3 vectors of the form {x, y, z}. In other words, vectors of three 32-bit floats. So the first optimization I did was the simplest and most naive thing I could think of, which was to implement all of the float3 operations using SIMD intrinsics. I knew this wouldn’t be optimal, but figured it would give me a sense for whether it was worth investing more work to design a CPU-specific solution.
Note that most of the time when one is dealing with float3 vectors, they are aligned in memory as if they were float4, in other words to 32-byte boundaries. So really you’re working with vectors like {x, y, z, w}, even if the w component is not actually used.
For this experiment I used the 128-bit SIMD instructions offered by SSE on x86_64 processors and NEON on arm64 processors. Because Anukari’s float3 vectors are really float4 vectors with an ignored w component, it’s extremely simple to implement basic arithmetic operations using SSE/NEON. In both cases, there’s an instruction to load the float4 into a SIMD register, an instruction to do the arithmetic operation (such as add), and then an instruction to store the float4 register back into memory.
Thus, the Float3Add() function might look like this using SSE:
__m128 p1 = _mm_load_ps(&position1);
__m128 p2 = _mm_load_ps(&position2);
__m128 d = _mm_add_ps(p2, p1);
_mm_store_ps(&delta, d)
If you squint your eyes, the NEON code is identical:
float32x4_t p1 = vld1q_f32(&position1);
float32x4_t p2 = vld1q_f32(&position2);
float32x4_t d = vaddq_f32(p2, p1);
vst1q_f32(&delta, d);
Along these lines I implemented all the basic float3 operations. Some operations were slightly more complex, such as the dot product, normalization, etc, but this code was all very simple and easy to write, because it dropped straight in to replace the GPU code that operated on float3 vectors in the same way.
This was only an afternoon of work, and already I was seeing big speed improvements in the CPU implementation. It was still not as good as the GPU version, but at this point I was convinced that it would be worth redesigning the physics simulation from scratch to take full advantage of CPU SIMD instructions.
Second approach: compiler auto-vectorization
Now that I’ve explained why I decided it was worth redesigning the physics simulation in a CPU-specific way, I should say a few words about why the way the code was written for the GPU is not optimal for the CPU. There are a multitude of reasons, which I’ll lump broadly into two categories.
1. Memory access patterns
The GPU code is heavily-optimized to take advantage of the unique memory situation on the GPU.
Anukari’s GPU design relies on the fact that each threadgroup has a huge number of registers. On a typical modern NVIDIA card, a threadgroup has 65,536 32-bit float registers. Compared to a CPU, this is an unfathomably large number. There are some limitations, such as the fact that a single thread can access a maximum of 256 registers, but since Anukari runs up to 1,024 threads, this limit is not relevant (since 65,536 / 1024 = 64 registers per thread).
Note the very surprising fact that each NVIDIA thread group has more register memory than a typical modern CPU core has L1 cache! This calls for a very different design. Registers are the fastest memory possible, and so Anukari crams as much thread-private information into them as it can.
Another thing Anukari’s GPU design relies on is that all the threads in a thread group have extremely fast access to shared thread group memory, which is essentially manually-mapped L1 cache. Anukari uses this for fast inter-thread communication of physics parameters. For example, when calculating spring forces for a given mass, the position of each neighboring mass is looked up from this shared memory. These lookups are random-access so it’s very important that they are served straight from L1 without any possibility of going to main memory.
On the CPU, Anukari can’t rely on a gigantic register space or ultra-fast shared memory. The design will have to be centered around a new memory layout that is friendly for the CPU cache design.
Note that especially when using SIMD instructions, caching becomes incredibly important, because the SIMD calculation throughput is generally higher than main memory throughput. So the only way to saturate the FPU is to ensure the memory layout allows efficient caching (and prefetching).
2. SIMD-unfriendly control flow
When I was first optimizing Anukari’s GPU code, it was a bit of a surprise to me to find out that for the most part, each GPU thread is a scalar processor. Unlike a CPU, there are no special instructions for operating on vectors, because the parallelism comes from zillions of threads instead. There are some exceptions to this, like Apple Silicon hardware being able to do vector memory loads, but if you look for example at CUDA you’ll find that there’s no Float3Add() method in the APIs. You just have to write it yourself with scalar code.
In many ways this makes the bottom-level GPU code easier to write for highly-parallel problems, because you can just write simple scalar code. You don’t have to worry about the parallelism because that happens at a higher level (threads). Other aspects are more difficult on the GPU, but this part is nice.
But this kind of code layout does not map well onto the CPU, where parallelism comes from SIMD instructions that operate on multiple floats at one time (vectors). On the CPU, you have to think much more explicitly about how to lay out the data so that it can be operated on in SIMD fashion.
For example, take this simple operation:
for (int i = 0; i < m; ++i) {
float x = ComputeX(i);
float y = ComputeY(i);
result[i] = x * y;
}
On the GPU, this code might well be fine. Each thread is basically a scalar machine, so the scalar multiply of x and y is efficient.
But on the CPU, this code is awful. Each multiply will map to a single scalar instruction, which fails to take advantage of the fact that the CPU is capable of 4, 8, or even 16 multiplications via a single instruction (depending on the CPU). Storing the result of the multiplication similarly could be made more efficient with SIMD instructions, but is not.
Auto-vectorization
One way to restructure the above GPU code to take better advantage of the CPU’s resources is as follows:
float xs[m];
float ys[m];
for (int i = 0; i < m; ++i) {
xs[i] = ComputeX(i);
ys[i] = ComputeY(i);
}
for (int i = 0; i < m; ++i) {
result[i] = xs[i] * ys[i];
}
This looks a bit silly at first. We’re doing two passes over the loop instead of one, and we have to store data in two intermediate arrays.
However there is magic that can happen in the third loop, which is that the compiler can automatically output SIMD instructions instead of scalar instructions. The compiler sees that each of the three arrays involved are contiguous blocks of floats, and thus they can be loaded as vectors (of whatever width the targeted instruction set handles), operated on as vectors, and stored as vectors. For a machine with a SIMD width of 8, this loop will process 8 floats per iteration, which is potentially a gigantic speedup.
There are many caveats here. For a full 8x speedup, the data being operated on definitely needs to fit into L1 cache, or else the loop will be memory-bound and while it still may be faster than scalar instructions, it won’t be 8x faster. As a consolation prize, though, if the memory is not already cached, at least it is contiguous, so the processor’s automatic prefetching will be effective.
In many cases, especially with small datasets, writing the code this way can allow the compiler to automatically output efficient platform-specific instructions while keeping the code portable and readable.
At this stage I did a complete rewrite of Anukari’s physics engine, restructuring the data structures so that they were laid out in memory in a way that made auto-vectorized loops easy to write. I encapsulated all the basic math operation loops in functions so that the code was pretty readable:
float deltas[m];
Subtract(m, positions1, positions2, &deltas);
float lengths[m];
Length(m, deltas, &lengths);
And so on. This was a pretty huge overhaul, since the code had to be structured very differently than on the GPU.
But it was worth it. At this point, the speed of the physics simulation on the CPU was very similar to the GPU. For some Anukari presets it was even faster.
Third approach: manual vectorization via intrinsics
Compiler auto-vectorization is really nice. The fact that it lets you write portable code without worrying about the specifics of the platform you’re targeting feels pretty magical. The compiler automatically uses the largest SIMD width available for the platform, and also deals with things like outputting a bit of scalar code at the end of each vector loop to handle “remainder” elements, to handle arrays where the length is not an exact multiple of the SIMD width.
However, there are limitations to what compilers can do. They may feel magical, but they are not.
I found that the compilers I’m using (MSVC, Apple Clang) both had extremely strict limitations on what kinds of loops they can vectorize. The limitations differ between compilers, but in general I’d describe them as extraordinarily conservative.
The problem compiler authors have is that the visible behavior of the instructions they generate must adhere to the language spec at all times. So they have to detect cases where there’s any possibility that vectorization might produce observably different results than scalar code, in which case they cannot vectorize.
This means that to write code that the compiler can auto-vectorize, you have to be extremely deliberate. You have to learn about the compiler’s requirements in detail, and then have to structure the code in a very particular way to ensure that the compiler has nothing to be paranoid about.
The MSVC compiler is really nice here, because you can ask it to give you an error code for any loop that is not vectorized. But there are more than 35 reason codes it might generate for why a loop is not vectorized. It’s extremely cautious! Here’s one example of a loop that MSVC cannot vectorize, taken from their documentation:
// Code 501 is emitted if the compiler cannot discern the
// induction variable of this loop. In this case, when it checks
// the upper bound of 'i', the compiler cannot prove that the
// function call "bound()" returns the same value each time.
// Also, the compiler cannot prove that the call to "bound()"
// does not modify the values of array A.
for (int i=0; i<bound(); ++i) {
A[i] = A[i] + 1;
}
This leads to situations where as the programmer, you can see that a loop is trivially safe to vectorize, but the compiler cannot prove to itself that it can do so safely and thus emits scalar instructions.
At first I jumped through all the hoops to structure my code in a way where the compiler could auto-vectorize it. All pointers used inside loops were stored as locals to prove that they were not modified concurrently by another thread. All pointers were marked as __restrict to guarantee that they were not aliased. Loops were marked with compiler-specific pragmas to inform the compiler that there were no data dependencies to worry about. And so on.
But in the end, I found that despite jumping through all the hoops, I still had to look at the assembly output every time I compiled my code. Both compilers have options to produce warnings about non-vectorizable loops, but I did not find these completely reliable. Even when they worked, sometimes I found that the way the compiler vectorized a loop was weirdly inefficient.
Ultimately I came to the conclusion that if I was going to have to look at the compiler’s assembly output after every compile, on both of my target platforms (x86_64 and arm64), then I may as well just give up on compiler auto-vectorization and write platform-specific instructions directly using intrinsics.
Initially I was resistant to this idea, but very quickly I found that it was much simpler to explicitly tell the compiler what instructions to use than to hope that it implicitly figured out what I wanted. Knowing what I know now, I will never attempt to rely on auto-vectorization again for a problem that requires SIMD instructions. It is easier to just do it yourself.
So I went through my SIMD code, and for example rewrote functions that were originally like this (leaving out a lot of the details):
void Subtract(int m, float* xs, float* ys, float* results) {
for (int i = 0; i < m; ++i) {
results[i] = xs[i] - ys[i];
}
}
To look like this instead:
void Subtract(int m, float* xs, float* ys, float* results) {
#if USE_AVX2
for (int i = 0; i < m * 4; i += 8) {
__m256 x = _mm256_load_ps(&xs[i]);
__m256 y = _mm256_load_ps(&ys[i]);
__m256 r = _mm256_sub_ps(x, y);
_mm256_store_ps(&results[i], r)
}
#elif USE_NEON
for (int i = 0; i < m * 4; i += 4) {
float32x4_t x = vld1q_f32(&xs[i]);
float32x4_t y = vld1q_f32(&ys[i]);
float32x4_t r = vsubq_f32(x, y);
vst1q_f32(&results[i], r);
}
#endif
}
This may look complicated at first glance, but really it’s dirt simple. It’s not actually complicated, it’s just that it’s so explicit that it’s a bit verbose. I have found that it is much easier to get the results I want this way than with auto-vectorization.
By the way, you may have noticed that my hand-written code does not include a scalar remainder loop to handle the case that the length of the input arrays is not an exact multiple of the SIMD width. How? To keep the code simple, I obviate the need for remainder loops by guaranteeing that all array lengths are rounded up to the SIMD width.
This does mean that some extra unused floats may be processed at the end of the loop. But when the arrays are large, this becomes a rounding error.
(Note to readers who may try this: if you write a SIMD loop that does a reduction, for example accumulating a sum, be very careful about remainder elements so that you don’t accumulate garbage data! You can guess how I know this.)
At this stage of the physics code rewrite, the CPU simulation’s speed is now uniformly a little faster than on the GPU.
Current approach: bespoke single-pass intrinsics
Despite now having a CPU implementation that worked better than the old GPU version, I was still not satisfied. It was now becoming clear to me that actually I could make the CPU code not just a little faster, but substantially faster than the GPU code.
While the SIMD-vectorized loops above make efficient use of SIMD instructions, and access memory in a linear way that is cache-friendly, there are still some obvious drawbacks.
One problem is that the code ends up iterating over the same data many times, in many loops. If the dataset fits into L1 cache, this is quite fast, but reading L1 is still slower than reading registers, and also we have to consider overhead of the loop instructions themselves. If the dataset is bigger than L1 cache, it’s a performance disaster as things spill to L2 which is much slower.
A deeper problem, though, is that Anukari’s physics simulation has a few places where the data has to come from random access memory fetches rather than beautifully-efficient linear accesses.
For example, while iterating over each spring, we have to fetch the position of the spring’s endpoints from the masses it’s connected to. And later we have to accumulate the spring’s force into memory for each mass, which is also random access. Unfortunately we cannot assume this data is in the L1 cache.
There are tricks here, like sorting the springs by the memory address of their attached masses. But because each spring is attached to two masses, this can only completely linearize the lookups for one end of the springs. It helps, but there is still quite a bit of random access for the other end.
After a few rounds of profiling and optimization, it was clear that these random accesses were the biggest bottleneck, so I brainstormed ways to improve the situation.
The original code was structured in three parts: 1. a gather pass where random access fetches were stored into a linear array, 2. the processing passes, and 3. a scatter pass where spring forces were written out to random access locations. Both the scatter and gather passes were quite slow.
The trick that ended up helping was to rewrite this code as a single pass that performed the processing in-line with the gather and scatter, so that the processor could pipeline all these instructions and ultimately do all the computation while waiting on the memory fetches and stores.
The original multi-pass algorithm allowed the processor to sit idle with an empty pipeline while waiting for fetches/stores, whereas this new algorithm kept it busy. The pipelining not only parallelized computation with memory access, but also allowed the fetch and store instructions to overlap.
Furthermore, I paid attention to how many CPU registers the loop required, looked at the assembly output to confirm, and managed to write it in such a way that the loop does not spill registers onto the stack. So in some ways Anukari’s spring simulation is theoretically optimal: it reads each required piece of data exactly once from memory, does all the intermediate computations using only registers and only full-width SIMD instructions (zero scalar operations), and writes the result exactly once.
This technique provided a 3-4x speedup for the spring simulation, which tends to be Anukari’s single hottest loop. Seeing this advantage, I then went on to rewrite the next few hottest loops using the same technique. Even loops that do not perform random memory accesses benefit substantially from less loop overhead and fewer round-trips to L1 cache.
After this round of optimization, the CPU code is now substantially faster than the GPU code in virtually all cases.
Drawbacks
The drawback to this latest way of structuring the SIMD code is that it does make the code significantly more complicated and more difficult to read. Before, the simulation was a series of tidy API calls like Add(), Subtract(), Length(), and so on, and the platform-specific SIMD intrinsics were tucked away in the implementation of that API in compact single-purpose functions.
In the new world, though, the simulation for each type of physics object is completely written inline using platform-specific SIMD intrinsics with no tidy abstraction. In itself this is not so bad, but to squeeze out maximum performance, some of these loops have to be structured in pretty weird ways.
I mentioned above that the hottest loops do all operations at full SIMD width with zero scalar operations. This is the design factor that introduces the most complexity.
The problem is that Anukari often does operations that alternate between 3-dimensional float3 vectors, and scalar values. For example, it might take the length of a float3, which goes from 3D to 1D, and then multiply another float3 by that length, going back from 1D to 3D.
The length operation involves taking the square root of a scalar value, which is very slow, and we definitely want to vectorize. But to vectorize the square root, we need to have enough scalar values to fill a SIMD register.
Using NEON for this example, the SIMD width is 4, meaning a register contains 4 32-bit floats. So if we want to vectorize the square root, we have to unroll the outermost loop so that on each iteration it processes 4 float3s. It takes the dot product of each one with itself to get 4 squared lengths, which are packed into a single SIMD register and then a single square root instruction can be used to compute all 4 square roots at once.
This gets quite messy, and some cleverness is involved to structure things such that the shuffling of 32-bit floats around between SIMD registers can be done using only SIMD instructions (rather than pulling individual SIMD lanes out into float registers and re-packing, using many instructions). But this is all doable, and the results are incredibly worthwhile.
Sadly, because there is manual loop unrolling and weird clever shuffling going on, the structure of the code at this point diverges pretty dramatically between AVX2 (with a SIMD width of 8) and NEON (SIMD width of 4). So there are two very independent implementations for the hottest loops.
An aside about float3 handling with SIMD
Earlier in this post, I mentioned that often when handling float3 vectors, they are processed as float4 vectors with an ignored “w” component. This is a bit sad, because it means that 25% of the processor’s SIMD lanes are being wasted.
There is a way to eliminate this inefficiency and to utilize 100% of the SIMD lanes.
The reader may be familiar with the terms Structure of Arrays (SoA) and Array of Structs (AoS). The version of the SIMD code that operates on float3s with an ignored “w” component would be considered AoS, because each float3 is a structure of 4 floats which are contiguous in memory: {x, y, z, w}:
void Subtract_AoS(int m, float* xs, float* ys, float* results) {
for (int i = 0; i < m * 4; i += 8) {
_mm256_store_ps(&results[i],
_mm256_sub_ps(
_mm256_load_ps(&xs[i]),
_mm256_load_ps(&ys[i])));
}
}
But it’s possible to reorganize the data structures in a SoA fashion, in which case we can rewrite the code in a way that does not waste any SIMD lanes:
void Subtract_SoA(int m, float3* a, float3* b, float3* result) {
for (int i = 0; i < m; i += 8) {
_mm256_store_ps(&results->x[i],
_mm256_sub_ps(
_mm256_load_ps(&a->x[i]),
_mm256_load_ps(&b->x[i])));
_mm256_store_ps(&results->y[i],
_mm256_sub_ps(
_mm256_load_ps(&a->y[i]),
_mm256_load_ps(&b->y[i])));
_mm256_store_ps(&results->z[i],
_mm256_sub_ps(
_mm256_load_ps(&a->z[i]),
_mm256_load_ps(&b->z[i])));
}
}
This is pretty verbose and repetitive, but notice that the main change is that instead of one array of {x, y, z, w} structs, there are now three arrays, one for each of the x, y, and z values.
There are in fact two big advantages to this approach. First is the one we’ve already mentioned, which is that there is no “w” component being computed, and thus there are no wasted SIMD lanes. The second advantage is that the number of instruction operations per loop cycle is tripled, so the overhead of the loop instructions themselves (compare and increment) is decreased. (It’s a subtle detail, but notice that the second version’s for loop has i < m instead of i < m * 4.)
Due to these advantages, I originally used the SoA version of this code for Anukari. However, it eventually turned out that in all the cases that mattered, the AoS version was much faster, so I restructured to use AoS instead.
I found this a little surprising, because whatever advantage the AoS version has, it has to be enough to make up for the 25% SIMD lane wastage.
The reason AoS performs better for Anukari comes back to the fact that the slowest parts of the simulation code are where random memory accesses take place. In the SoA memory layout, randomly accessing a float3 actually requires three random scalar memory accesses. Whereas a random access of a float3 in the AoS layout is a single vector access.
Therefore while the computation passes over the AoS data are somewhat slower, the random accesses in other parts of the code are dramatically faster, more than making up for it.
A slightly more subtle consideration has to do with the size of a cache line, which for the processors I care about is always 64 bytes. Both the SoA and AoS approaches do random lookups of data that is smaller than a single cache line. The CPU has to always read a full cache line from memory, which means that it’s reading extra bytes that we do not use, wasting memory bandwidth.
For SoA, we look up three 4-byte values from separate memory locations. This means that the CPU will fetch three 64-byte memory regions, wasting 180 bytes of memory bandwidth.
For AoS, though, we only look up one 32-byte value. The CPU will fetch one 64-byte memory region, wasting only 32 bytes of memory bandwidth.
In each case it’s possible that there are cache hits, depending on the random access pattern, in which case not all of the fetches are wasted. But if the dataset is large enough, or the lookups sparse enough, SoA wastes 6 times as much memory bandwidth than AoS does.
One last little side-side note: for random memory accesses, CPUs offer a feature called prefetching, which are instructions that tell the CPU about a future memory access that will be made. The idea is that in a random access loop, as you access each item you also tell the CPU about an item you’ll need a few iterations later. In principle this can allow the CPU to start fetching that memory before it is needed, concurrently with execution of the current instructions.
I experimented with prefetching quite a bit, and never found success with it. I’m not completely sure why, but my best guess is that it adds additional instructions, and my loops are already pipelined heavily enough that the cost of executing a few more instructions was not paid back by the prefetching.
Stay tuned
Despite the fact that I’ve probably already written way more than anybody ever wanted to read on this topic, there will be one last installment in this series of posts about Anukari on the CPU, where I will reflect on some of the lessons I learned as part of this process, as well as some other general thoughts on what went well and not so well.
Anukari on the CPU (part 1: GPU issues)
Captain's Log: Stardate 79314.6
It’s been a while since I posted a devlog entry, but it’s not for a lack of things to post about. In fact quite the opposite: I’ve been so heads-down in a complete rewrite of the physics simulation code that I haven’t had time to write a post. With the 0.9.21 release out the door and looking good, I’m excited to finally write about what I’ve been doing.
TL;DR: Anukari now runs on the CPU instead of the GPU, using hand-coded platform-specific SIMD instructions, performing much better and solving all of the compatibility headaches that come with doing audio on the GPU.
This may come as a surprise to readers who have followed the Anukari GPU saga. Trust me, it comes as a bigger surprise to me. In this multi-part devlog entry I will explain why Anukari used the GPU in the first place, why I did not think it was possible to run it on the CPU, why I was wrong about that, and how the new CPU implementation works.
The GPU origin story
When I began working on Anukari, I was not at all sure that the kind of physics simulation I wanted to build could run in real time. I committed to using a “fail fast” design strategy, where I would start with the problems I considered most difficult/uncertain, and try to prove I could solve them before wasting effort on anything else.
In the first month of the project, my top two goals were to prove that (1) I could make the simulation run in real time, and (2) the simulation could produce interesting sounds. If either of these points proved false, I would have moved on to my next project idea.
The first thing I did was hack together the simplest 1D spring/mass simulation possible, in C++ code running on the CPU. Within a day it was producing interesting sounds in real time with a small number of masses. I then expanded this into the hackiest possible 3D simulation, and again, found the sounds interesting enough that I felt it was worth continuing.
At this time, I was able to simulate something like 10 point masses in 3D in real time. The sounds were interesting, but I had done some offline rendering experiments at one point with thousands of point masses, and really wanted to achieve that. I set the goal of simulating 10,000 point masses in real time, which in 3D means that I’d want something like 50,000 springs.
At this point my simulation code was not at all optimized. It was just the simplest thing I could hack together, with no thought towards speed. But I did some napkin math, and even assuming that I could achieve a 100x speedup by optimizing this code, and maybe parallelizing it across threads, it still was not realistic to simulate 50,000 springs. There was just no way.
Now, at this point I already had a background in GPU programming. I’ve written many small 3D game engines, and also at one point I wrote a GPU-based SHA-1 cracking utility as part of a hackathon to prove that my company’s secret salt was not complex enough. So naturally it occurred to me that the GPU might have the FPU bandwidth I needed.
And indeed, doing some napkin math on the FLOPS provided by a reasonable GPU, the numbers worked out. It seemed possible.
Given that I had done GPU work before, it did not take long for me to hack together a simple implementation of the physics simulation using OpenCL. Immediately I was able to simulate several hundred physics objects in real time, and there seemed to be a path to simulating much larger systems.
At this point, I went ahead and invested all my effort in the GPU simulation and never looked back. Until a couple months ago.
A surprising discovery
Over time, I ported the OpenCL simulator to CUDA and then Metal. I found many optimizations along the way, and came to know the various GPU APIs quite intimately. I learned a huge amount about how the hardware actually works, and took the best advantage of it that I could to get the simulation running extremely fast.
The physics simulation code was 95% shared between the GPU APIs. Their common denominator is C, and so the GPU code is in straight C (not C++). I used some awful macro sorcery to get the same C code compiling as OpenCL C, CUDA C++, and Metal C++.
Because the simulation was just C, I always had the idea that eventually I could write a naive CPU backend that shared the same C code, so that users without a good GPU could at least run tiny instruments in a pinch. But the GPU C code was highly-optimized for GPU-style parallelism, and depended a lot on threadgroup memory for fast inter-thread communication, etc, so I figured the naive CPU backend would perform really poorly.
A while back a user asked me why Anukari was glitching when he ran it at the same time as his VJ visualization software. I looked into it, and the VJ software was (reasonably) assuming that it owned the GPU, and Anukari’s GPU physics code was getting starved. I explained this to the user, and he understood, but I was not very satisfied with this situation.
I thought, hey, maybe it’s time to slam out the naive CPU backend? For a user like this it might be better to be able to use Anukari for very simple instruments, even if bigger instruments are not possible. So I spent a week or so hacking together the stupidest CPU backend possible, calling directly into the GPU C code.
I was extraordinarily surprised with how fast the CPU backend ran. It wasn’t nearly as fast as the GPU, but it was only 5x slower, whereas I was expecting it to be 100x slower. At this point I began to realize that something in my assumptions was really badly broken, and pivoted all my effort to understanding what was going on.
I profiled the CPU backend and immediately realized that there was a ton of room for optimization. The hottest loops were not using CPU SIMD instructions at all, which seemed like it might be a huge opportunity. I began to wonder if a CPU implementation might actually be able to perform as well or possibly even better than the GPU implementation.
Uneven GPU sharding
At this point it’s worth asking how the CPU implementation could possibly be competitive with the GPU version. In terms of raw 32-bit FLOPS, the GPU is the clear winner, and we’ve established that Anukari is FPU-bound. So what gives?
Fundamentally the problem is that Anukari’s physics simulation cannot utilize the full power of the GPU, due to the synchronization required for the physics data dependencies.
Anukari’s simulation is a simple numeric integration in which the time step is the duration of one audio sample. At each step, the net force (and other physics parameters) is computed for each object, and is applied to determine the acceleration, velocity, and next position of the object.
Since physics objects can affect one another, via springs or other connections, it is not possible to compute step N + 1 for an object until step N has been fully computed for all connected objects. For example, to calculate spring forces for a given mass, we need to know the current positions of any masses that are connected to it.
The way Anukari’s GPU code handles this is as follows:
Each physics object is mapped 1:1 to a thread on the GPU Shared state (such as position) is stored in threadgroup memory, equivalent to L1 cache Each thread computes step N for its object and then waits for a threadgroup barrier before proceeding to step N + 1, guaranteeing that all other threads have been stepped forward
This design takes advantage of some incredibly useful GPU features. The threadgroup memory is basically a manually-mapped L1 cache, and provides incredibly fast access to shared state among all threads in the threadgroup. And the threadgroup barrier is an extremely efficient way to synchronize each integration step.
In case you’re wondering what happens if the threadgroup barrier is omitted, and threads are allowed to progress more than 1 physics step beyond their peers, here’s a video of the simulation without the barrier removed:
Having physics objects 1:1 with threads does impose a limitation on the size of the instrument that can be simulated, because GPU threadgroups have a hardware-limited number of threads (commonly 1,024 threads on recent hardware). Anukari runs up to 16 independent voices (i.e. multiple MIDI notes) in separate threadgroups because they don’t have physics interactions. This allows a total of 16,384 objects to be simulated in real time, but still only in physics-connected groups of 1,024 objects.
Given that these threadgroup features are so powerful, what’s the problem? The answer lies in uneven sharding.
In distributed systems design, it is often the case that you need to divide N work items among M machines to compute in parallel. For "embarrassingly parallel” problems, you can choose an owner machine for a work item by hashing the work item’s ID and taking the modulus M. Any good hash function will have a uniform distribution, so this uniformly assigns work items to machines. This diagram shows the runtime of several machines which have been assigned roughly equal amounts of work:
Or does it? If the work item IDs are unique, this works great. But what if work item IDs contain duplicates? The duplicates will all have the same hash, and thus will be handled by the same machine. In web engineering, an extremely common problem here will be processing web pages. Imagine if you use the domain as the ID field for hashing. Some domains may have millions of web pages. So if you hash on the domain, work will no longer be assigned equally to machines. It’s easily possible that the machine that receives the largest domain will have 10x as much work as any other machine. Now the overall wallclock runtime of the job is huge because you’re stuck waiting for one machine to do a bunch of work alone, even while all other machines are finished.
This diagram shows the runtime of several machines, where one has been assigned much more work than the others:
Coming back to Anukari, since physics objects map 1:1 with threads, we do not have this particular kind of uneven sharding. But there’s a different kind, which is that the computation required to process different kinds of physics objects is not uniform.
As mentioned above, a threadgroup barrier is used to ensure that threads step forward in lockstep. This means that the wallclock duration of computing a thread is the time it takes to compute the integration for the slowest physics object.
In Anukari, the computation required for an “Anchor” is just collecting modulation signals from any attached modulators that affect its position, and interpolating a new position.
But a “Mic” object is far more involved. It has to collect modulation signals as well as information about connected bodies. Its rotation and gain can be modulated, which require slow sin/cos/log instructions. It has several other features to calculate, including a compressor, and even a sqrt for the isotropy feature.
So computationally a Mic can easily be 10x the work to compute than an Anchor. Thus at each physics step, the Anchor threads spend 90% of their time idle, blocking on the threadgroup barrier while the Mic threads finish the step.
The diagram below shows the timeline for several threads, where one thread has much more work than the others. At each time step, most of the threads finish early and wait idle until the slowest thread is ready to begin the next cycle:
The problem is compounded by objects that need to collect signals from many other objects. For example, if a Mic is connected to 100 Body objects, it has to look up the position of all those 100 Bodies to accumulate the final audio sample. This Mic will therefore delay each step even further.
This is why Anukari can’t take advantage of the full FPU bandwidth of the GPU. It is limited to using a single thread per physics object, and due to the unequal sharding, most of those threads spend a lot of their time idle. Idle threads equate to unused FPU cycles.
Uneven sharding solutions?
There are ways to partially address Anukari’s uneven sharding on the GPU.
First, the computation for a physics object could be pipelined. So for a Mic, we could split the computation into three steps that run in parallel: 1. collect signals from connected bodies, 2. apply transformations, and 3. apply compression. The computations for a single Mic would run in 3 threads, each reading the output from the previous thread at each physics step. This would introduce a small amount of delay (2 audio samples) in the output signal, but this is not a big problem. It would reduce the unequal sharding problem to the duration of the longest of the 3 pipeline steps.
However there are practical problems here. By having each Mic map to 3 threads, the total number of physics objects that Anukari can simulate is reduced (since the total number of threads is limited). Obviously it adds significant complexity to the implementation. And still the collection step may be very long if many objects are connected to the Mic.
The issue with a heavily-connected object is difficult. The first idea I had was to rewrite the simulation as two alternating passes: 1. a pass to collect signals from connections, and 2. the per-object physics pass. The idea is that the first pass would spread connections uniformly over threads, so no thread would hotspot.
However this does not work, because the signals that are collected from connected objects need to be accumulated, and the accumulation requires synchronization. For example, for an object connected to 100 other objects with springs, the net force must be accumulated. So if springs were assigned to separate threads, those threads would have to synchronize read-update-writes to a shared force accumulator. This would require atomics, which is not really possible for float3 vectors (and even if you pull it off, this is serializing part of the work anyway).
There are other possibilities for Anukari to make better use of the GPU’s resources, but they get increasingly complex and difficult.
How the CPU competes with the GPU
Ultimately we’ve seen that the issue with Anukari’s GPU solution is that most of its threads sit idle most of the time waiting for hotspot threads, wasting precious FPU bandwidth.
This is why the CPU can compete with the GPU: despite the lower total FPU bandwidth of the CPU, Anukari’s simulation algorithm can much more easily utilize that bandwidth fully.
On the CPU, each independent MIDI voice is assigned to a separate physical thread, since separate voices do not have any data dependencies between them. Each thread then independently simulates N audio samples in a loop like, “for each audio sample, for each object, step forward.” No synchronization is required because the CPU runs one integration step for all objects before moving on to the next step, ensuring lock-step simply by the lack of concurrency.
Thus on the CPU there is never any time when a thread goes idle, and the FPU can be fully utilized. Of course this is still subject to whether we can actually read data from memory fast enough to saturate the FPU, which I’ll talk about in part 2 of this post.
Later in part 3 of this post, I will reflect on whether I could have avoided putting so much effort into the GPU implementation before finding the better CPU solution.
Finally Anukari has macros, and a preset API
Captain's Log: Stardate 79052
The problem
Anukari has long had a modulation system, with LFOs, host automation controllers, MIDI, etc. But adding modulation to a preset has always been a kind of labor-intensive process. And one big gaping hole in the UX was the lack of a way to bind a knob inside Anukari itself to allow modulation to be controlled via the mouse.
The lack of mouse control was a hassle, but the problems were a bit deeper than that. Because of this issue, interfacing with the DAW was always not quite the way users expected. For example, in FL Studio you can choose an automation via the learn feature by "wiggling" a knob inside a VST plugin. FL Studio watches and sees which knob was wiggled and binds it. But of course with no mouse-controlled knobs inside Anukari, this was not possible.
Furthermore, while it was possible to map host parameters to automations inside Anukari, they could only be controlled via the DAW, which is really inconvenient, and often is a really weird workflow. Users expect to be able to hit "record" in the DAW and then go play the VST, knobs and all, and have everything recorded.
Macros
The solution was to add some knobs inside Anukari that can be mapped to control the modulation system. Those are shown here in the lower right-hand corner:
(The icons and graphics are still provisional while I wait for my designer to improve them.)
There are eight knobs in total (in the screenshot only four are showing, and the other 4 are collapsed). Each knob can be renamed, and corresponds to a mapping that will automatically appear in the DAW. And each knob is connected to any number of 3D Macro objects, which it will control.
This is already really handy, but the killer feature is the little grabby-hand button icon next to each macro knob. When the user drags this, parameters in the right-hand editor panel that can be modulated will automatically be highlighted, and when the user drops onto one of them, a 3D Macro object will be created which is automatically linked to the given parameter on all selected entities. Here's an example:
This is a pretty transformative change. It is dramatically easier to create automations inside Anukari, play with them, and edit them. And then they can be performed with the knob movements recorded in the DAW.
Side benefits
The new macro system addressed a bunch of feedback I repeatedly got from users, and solved a bunch of problems. But in addition to that, there were a number of extra advantages to the new system that came very cheaply.
Having the drag-and-drop system for modulation naturally made it easy to do the same thing for other modulator types. So now, if a user drags in an LFO from the entity palette on the bottom of the screen, they can drag it straight to a highlighted parameter to create an LFO connected to that parameter on the selected entities. This can be done with any modulator and is hugely convenient.
Another big benefit is that now all the built-in automations for all the factory presets are discoverable. Previously with no knobs in the main UI, there was no easy way to see what parameters had been configured for modulation as you cycled through presets. Now you can see them all, and trivially play with them via the mouse. Even better, in the standalone app the 8 macro knobs map to MIDI continuous control parameters 1-8, so on most MIDI controllers you can just turn knobs and things will happen, with visual feedback.
Finally this opens the door for even more interesting drag-and-drop use cases. The first one I have in mind is for creating exciter objects, like Mallets. The idea is that the user will be able to select a bunch of Bodies (masses), and then drag the Mallet object from the palette onto the right-panel (which will be highlighted) and it will automatically create the Mallet and connect it to all the selected Bodies. This will be much more convenient than the workflow today.
Anukari Builder (preset API)
In the Anukari Discord server, the user 312ears is notable for providing extremely helpful feedback about Anukari, obviously borne out of using it in depth. When I first released the Beta, they were one of the people who suggested that it would be cool if it were possible to create presets programmatically, for example via a Python API.
I really wanted to help with this, but for the foreseeable future my time has to be focused on making the plugin itself work better. So I offered to release the Google Protocol Buffer definitions for the preset file format and provide a bit of support on using them, but could not commit to any kind of nice APIs.
Anyway, 312ears took the Protocol Buffer definitions and built an entire Python API for building Anukari presets. Their API can be found here: github.com/312ears/anukaribuilder.
This is an absolutely incredible contribution to the project and community. It allows users who can write Python to create presets that would otherwise be far too tedious to make. On some hardware Anukari supports up to 1,000 physics objects, and arranging them in a complex geometric pattern is difficult with the UI. But with Python all kinds of things become possible. For example, 312ears has shown demos in the Discord server of presets that can shift between two shapes, say a sphere and a pyramid, by turning a MIDI knob. Here's a quick example:
The Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
VST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.