devlog > cpu
Anukari on the CPU (part 1: GPU issues)
Captain's Log: Stardate 79314.6
It’s been a while since I posted a devlog entry, but it’s not for a lack of things to post about. In fact quite the opposite: I’ve been so heads-down in a complete rewrite of the physics simulation code that I haven’t had time to write a post. With the 0.9.21 release out the door and looking good, I’m excited to finally write about what I’ve been doing.
TL;DR: Anukari now runs on the CPU instead of the GPU, using hand-coded platform-specific SIMD instructions, performing much better and solving all of the compatibility headaches that come with doing audio on the GPU.
This may come as a surprise to readers who have followed the Anukari GPU saga. Trust me, it comes as a bigger surprise to me. In this multi-part devlog entry I will explain why Anukari used the GPU in the first place, why I did not think it was possible to run it on the CPU, why I was wrong about that, and how the new CPU implementation works.
The GPU origin story
When I began working on Anukari, I was not at all sure that the kind of physics simulation I wanted to build could run in real time. I committed to using a “fail fast” design strategy, where I would start with the problems I considered most difficult/uncertain, and try to prove I could solve them before wasting effort on anything else.
In the first month of the project, my top two goals were to prove that (1) I could make the simulation run in real time, and (2) the simulation could produce interesting sounds. If either of these points proved false, I would have moved on to my next project idea.
The first thing I did was hack together the simplest 1D spring/mass simulation possible, in C++ code running on the CPU. Within a day it was producing interesting sounds in real time with a small number of masses. I then expanded this into the hackiest possible 3D simulation, and again, found the sounds interesting enough that I felt it was worth continuing.
At this time, I was able to simulate something like 10 point masses in 3D in real time. The sounds were interesting, but I had done some offline rendering experiments at one point with thousands of point masses, and really wanted to achieve that. I set the goal of simulating 10,000 point masses in real time, which in 3D means that I’d want something like 50,000 springs.
At this point my simulation code was not at all optimized. It was just the simplest thing I could hack together, with no thought towards speed. But I did some napkin math, and even assuming that I could achieve a 100x speedup by optimizing this code, and maybe parallelizing it across threads, it still was not realistic to simulate 50,000 springs. There was just no way.
Now, at this point I already had a background in GPU programming. I’ve written many small 3D game engines, and also at one point I wrote a GPU-based SHA-1 cracking utility as part of a hackathon to prove that my company’s secret salt was not complex enough. So naturally it occurred to me that the GPU might have the FPU bandwidth I needed.
And indeed, doing some napkin math on the FLOPS provided by a reasonable GPU, the numbers worked out. It seemed possible.
Given that I had done GPU work before, it did not take long for me to hack together a simple implementation of the physics simulation using OpenCL. Immediately I was able to simulate several hundred physics objects in real time, and there seemed to be a path to simulating much larger systems.
At this point, I went ahead and invested all my effort in the GPU simulation and never looked back. Until a couple months ago.
A surprising discovery
Over time, I ported the OpenCL simulator to CUDA and then Metal. I found many optimizations along the way, and came to know the various GPU APIs quite intimately. I learned a huge amount about how the hardware actually works, and took the best advantage of it that I could to get the simulation running extremely fast.
The physics simulation code was 95% shared between the GPU APIs. Their common denominator is C, and so the GPU code is in straight C (not C++). I used some awful macro sorcery to get the same C code compiling as OpenCL C, CUDA C++, and Metal C++.
Because the simulation was just C, I always had the idea that eventually I could write a naive CPU backend that shared the same C code, so that users without a good GPU could at least run tiny instruments in a pinch. But the GPU C code was highly-optimized for GPU-style parallelism, and depended a lot on threadgroup memory for fast inter-thread communication, etc, so I figured the naive CPU backend would perform really poorly.
A while back a user asked me why Anukari was glitching when he ran it at the same time as his VJ visualization software. I looked into it, and the VJ software was (reasonably) assuming that it owned the GPU, and Anukari’s GPU physics code was getting starved. I explained this to the user, and he understood, but I was not very satisfied with this situation.
I thought, hey, maybe it’s time to slam out the naive CPU backend? For a user like this it might be better to be able to use Anukari for very simple instruments, even if bigger instruments are not possible. So I spent a week or so hacking together the stupidest CPU backend possible, calling directly into the GPU C code.
I was extraordinarily surprised with how fast the CPU backend ran. It wasn’t nearly as fast as the GPU, but it was only 5x slower, whereas I was expecting it to be 100x slower. At this point I began to realize that something in my assumptions was really badly broken, and pivoted all my effort to understanding what was going on.
I profiled the CPU backend and immediately realized that there was a ton of room for optimization. The hottest loops were not using CPU SIMD instructions at all, which seemed like it might be a huge opportunity. I began to wonder if a CPU implementation might actually be able to perform as well or possibly even better than the GPU implementation.
Uneven GPU sharding
At this point it’s worth asking how the CPU implementation could possibly be competitive with the GPU version. In terms of raw 32-bit FLOPS, the GPU is the clear winner, and we’ve established that Anukari is FPU-bound. So what gives?
Fundamentally the problem is that Anukari’s physics simulation cannot utilize the full power of the GPU, due to the synchronization required for the physics data dependencies.
Anukari’s simulation is a simple numeric integration in which the time step is the duration of one audio sample. At each step, the net force (and other physics parameters) is computed for each object, and is applied to determine the acceleration, velocity, and next position of the object.
Since physics objects can affect one another, via springs or other connections, it is not possible to compute step N + 1 for an object until step N has been fully computed for all connected objects. For example, to calculate spring forces for a given mass, we need to know the current positions of any masses that are connected to it.
The way Anukari’s GPU code handles this is as follows:
Each physics object is mapped 1:1 to a thread on the GPU Shared state (such as position) is stored in threadgroup memory, equivalent to L1 cache Each thread computes step N for its object and then waits for a threadgroup barrier before proceeding to step N + 1, guaranteeing that all other threads have been stepped forward
This design takes advantage of some incredibly useful GPU features. The threadgroup memory is basically a manually-mapped L1 cache, and provides incredibly fast access to shared state among all threads in the threadgroup. And the threadgroup barrier is an extremely efficient way to synchronize each integration step.
In case you’re wondering what happens if the threadgroup barrier is omitted, and threads are allowed to progress more than 1 physics step beyond their peers, here’s a video of the simulation without the barrier removed:
Having physics objects 1:1 with threads does impose a limitation on the size of the instrument that can be simulated, because GPU threadgroups have a hardware-limited number of threads (commonly 1,024 threads on recent hardware). Anukari runs up to 16 independent voices (i.e. multiple MIDI notes) in separate threadgroups because they don’t have physics interactions. This allows a total of 16,384 objects to be simulated in real time, but still only in physics-connected groups of 1,024 objects.
Given that these threadgroup features are so powerful, what’s the problem? The answer lies in uneven sharding.
In distributed systems design, it is often the case that you need to divide N work items among M machines to compute in parallel. For "embarrassingly parallel” problems, you can choose an owner machine for a work item by hashing the work item’s ID and taking the modulus M. Any good hash function will have a uniform distribution, so this uniformly assigns work items to machines. This diagram shows the runtime of several machines which have been assigned roughly equal amounts of work:
Or does it? If the work item IDs are unique, this works great. But what if work item IDs contain duplicates? The duplicates will all have the same hash, and thus will be handled by the same machine. In web engineering, an extremely common problem here will be processing web pages. Imagine if you use the domain as the ID field for hashing. Some domains may have millions of web pages. So if you hash on the domain, work will no longer be assigned equally to machines. It’s easily possible that the machine that receives the largest domain will have 10x as much work as any other machine. Now the overall wallclock runtime of the job is huge because you’re stuck waiting for one machine to do a bunch of work alone, even while all other machines are finished.
This diagram shows the runtime of several machines, where one has been assigned much more work than the others:
Coming back to Anukari, since physics objects map 1:1 with threads, we do not have this particular kind of uneven sharding. But there’s a different kind, which is that the computation required to process different kinds of physics objects is not uniform.
As mentioned above, a threadgroup barrier is used to ensure that threads step forward in lockstep. This means that the wallclock duration of computing a thread is the time it takes to compute the integration for the slowest physics object.
In Anukari, the computation required for an “Anchor” is just collecting modulation signals from any attached modulators that affect its position, and interpolating a new position.
But a “Mic” object is far more involved. It has to collect modulation signals as well as information about connected bodies. Its rotation and gain can be modulated, which require slow sin/cos/log instructions. It has several other features to calculate, including a compressor, and even a sqrt for the isotropy feature.
So computationally a Mic can easily be 10x the work to compute than an Anchor. Thus at each physics step, the Anchor threads spend 90% of their time idle, blocking on the threadgroup barrier while the Mic threads finish the step.
The diagram below shows the timeline for several threads, where one thread has much more work than the others. At each time step, most of the threads finish early and wait idle until the slowest thread is ready to begin the next cycle:
The problem is compounded by objects that need to collect signals from many other objects. For example, if a Mic is connected to 100 Body objects, it has to look up the position of all those 100 Bodies to accumulate the final audio sample. This Mic will therefore delay each step even further.
This is why Anukari can’t take advantage of the full FPU bandwidth of the GPU. It is limited to using a single thread per physics object, and due to the unequal sharding, most of those threads spend a lot of their time idle. Idle threads equate to unused FPU cycles.
Uneven sharding solutions?
There are ways to partially address Anukari’s uneven sharding on the GPU.
First, the computation for a physics object could be pipelined. So for a Mic, we could split the computation into three steps that run in parallel: 1. collect signals from connected bodies, 2. apply transformations, and 3. apply compression. The computations for a single Mic would run in 3 threads, each reading the output from the previous thread at each physics step. This would introduce a small amount of delay (2 audio samples) in the output signal, but this is not a big problem. It would reduce the unequal sharding problem to the duration of the longest of the 3 pipeline steps.
However there are practical problems here. By having each Mic map to 3 threads, the total number of physics objects that Anukari can simulate is reduced (since the total number of threads is limited). Obviously it adds significant complexity to the implementation. And still the collection step may be very long if many objects are connected to the Mic.
The issue with a heavily-connected object is difficult. The first idea I had was to rewrite the simulation as two alternating passes: 1. a pass to collect signals from connections, and 2. the per-object physics pass. The idea is that the first pass would spread connections uniformly over threads, so no thread would hotspot.
However this does not work, because the signals that are collected from connected objects need to be accumulated, and the accumulation requires synchronization. For example, for an object connected to 100 other objects with springs, the net force must be accumulated. So if springs were assigned to separate threads, those threads would have to synchronize read-update-writes to a shared force accumulator. This would require atomics, which is not really possible for float3 vectors (and even if you pull it off, this is serializing part of the work anyway).
There are other possibilities for Anukari to make better use of the GPU’s resources, but they get increasingly complex and difficult.
How the CPU competes with the GPU
Ultimately we’ve seen that the issue with Anukari’s GPU solution is that most of its threads sit idle most of the time waiting for hotspot threads, wasting precious FPU bandwidth.
This is why the CPU can compete with the GPU: despite the lower total FPU bandwidth of the CPU, Anukari’s simulation algorithm can much more easily utilize that bandwidth fully.
On the CPU, each independent MIDI voice is assigned to a separate physical thread, since separate voices do not have any data dependencies between them. Each thread then independently simulates N audio samples in a loop like, “for each audio sample, for each object, step forward.” No synchronization is required because the CPU runs one integration step for all objects before moving on to the next step, ensuring lock-step simply by the lack of concurrency.
Thus on the CPU there is never any time when a thread goes idle, and the FPU can be fully utilized. Of course this is still subject to whether we can actually read data from memory fast enough to saturate the FPU, which I’ll talk about in part 2 of this post.
Later in part 3 of this post, I will reflect on whether I could have avoided putting so much effort into the GPU implementation before finding the better CPU solution.
The Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
VST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.