The new warp alignment optimizer
As I mentioned in the previous post, I've been working on writing a better algorithm for optimizing the way that entities are aligned to GPU warps (or as Apple calls them, SIMD-groups). For the sake of conversation, let's assume that each GPU threadgroup is 1024 threads, and those are broken up into 32 warps, each of which has 32 threads. (These happen to be actual numbers from both my NVIDIA chip and my Apple M1.)
Each warp shares an instruction pointer. This is why Apple's name for them makes sense: each SIMD-group is kind of like a 32-data-wide SIMD unit. In practice these are some pretty sophisticated SIMD processors, because they can do instruction masking, allowing each thread to take different branches. But the way this works for something like "if (x) y else z" is that all the SIMD unit executes instructions for BOTH y and z on every thread, but the y instruction is masked out (has no effect) for threads that take the z branch, and z is masked out for threads that take the y branch. This is not a huge deal if y and z are simple computations, but each branch has dozens of instructions, you have to wait for each branch to be executed in serial, which is slow.
Note that this penalty is only paid if there are actually threads that take both branches. If all the threads take the same branch, no masking is needed and things are fast. This is the key thing: at runtime, putting computations with similar branch flows in the same warp is much faster than mixing computations with divergent branch flows.
For Anukari, the most obvious way to do this is to group entities by type. Put sensors in a warp with other sensors, put bodies in a group with other bodies, etc. In total, including sub-types, Anukari has 11 entity types. This means that for small instruments, we can easily sort each entity type into its own warp, and get huge speedups. This is really the main advantage of porting from OpenCL to CUDA: just like Apple, NVIDIA artificially limits OpenCL to just 8 32-thread warps (256 threads). If you want to use the full 32 32-thread warps (1024 threads), you have to port to the hardware's native language. Which, we really do, because with OpenCL's 8 warps, we're much more likely to have to double-up two entity types in one warp. Having 32 warps gives us a ton of flexibility.
The Algorithm
So we have 11 entity types going into 32 buckets. This is easy until we consider that an instrument may have hundreds of one entity type, and zero of another, or that instruments might have 33 of one entity type, which doesn't fit into a single bucket, etc. What we have here is an optimization problem. It is related to the bin-packing problem, but it's more complicated than the vanilla bin-packing problem because there are additional constraints, like the fact that we can break groups of entities into sub-groups if needed, and the fact that it needs to run REALLY fast because this happens on the audio thread whenever we need to write entities to buffers.
I'm extremely happy with the solution I ended up with. First, we simplify the problem:
- Each entity type is grouped together into a contiguous unit, which might internally have padding but will never be separated by another entity type.
- We do not consider reordering entity types: there is a fixed order that we put them in and that's it. This order is hand-chosen so that the most expensive entities are not adjacent to one another, and thus are unlikely to end up merged into the same warp.
A quick definition: a warp's "occupancy" will be the number of distinct types of entities that have been laid out within that warp. So if a warp contains some sensors, and some LFOs, its occupancy would be 2.
The algorithm then is as follows:
- Pretend we have infinite warps, and generate a layout that would be optimal. Basically, assign each entity type enough warps such that the maximum occupancy is 1. (This means that some warps might be right-padded with no-op entities.)
- If the current layout fits into the actual amount of warps the hardware has, we are done.
- If not, look for any cases where dead space can be removed without increasing any warp's occupancy. (On the first iteration, there won't be any.)
- Increment the maximum allowable occupancy by 1, and merge together any adjacent warps that, after being merged, will not exceed this occupancy level.
- Go back to step 2.
That's it! This is a minimax optimizer: it tries to minimize the maximum occupancy of any warp. It does this via the maximum allowable occupancy watermark. It tries all possible merges that would stay within a given occupancy before trying the next higher occupancy.
There are a couple of tricks to make this efficient, but the main one is that at the start of the algorithm, we make a conservative guess as to what the maximum occupancy will have to be. This way, if the solution requires occupancy 11 (say there are 1000 bodies and 1 of each remaining entity type, so the last warp has to contain all 11 types), we don't have to waste time merging things for occupancy 2, 3, 4, ... 11. It turns out that it's quite easy to guess within 1-2 occupancy below the true occupancy most of the time. I wrote a fuzz test for the algorithm, and in 5,000 random entity distributions the worst case is 5 optimizer iterations, and that's rare. Anyway, it's plenty fast enough.
The solutions the optimizer produces are excellent. In cases where there's a perfect solution available, it always gets it, because that's what it tries first. And in typical cases where compromise is needed, it usually finds solutions that are as good as what I could come up with manually.
Results
That's all fine and good, but does it work? Yes. Sadly I don't have a graph to share, because this doesn't help with all my microbenchmarks. Those are all tiny instruments for which the old optimizer worked fine.
But for running huge complex instruments, it is an ENORMOUS speedup, often up to 2x faster with the new optimizer. For example, for the large instrument in this demo video, it previously was averaging about 90% of the latency budget, with very frequent buffer overruns (the red clip lines in the GPU meter). That instrument is now completely usable with no overruns at all, averaging maybe 40% of the latency budget. Other benchmark instruments show even better gains, with one that never went below 100% of the latency budget before now at about 40%.
This opens up a TON more possibilities in terms of complex instruments. I think at this point, at least on Windows with NVIDIA hardware, I am completely satisfied with the performance. Apple with Metal is almost there but still needs just a tiny bit more work for me to be satisfied.
The CUDA port, and more optimization
The big news is that I've ported Anukari to NVIDIA's CUDA GPU framework, and all of the goldens tests are passing -- it appears to work perfectly at this point. And it's very fast. Anukari was already pretty fast on NVIDIA hardware, and for simple presets the difference isn't huge (maybe 5-10%), but for complex presets the difference is transformative.
This improvement mostly comes from the fact that NVIDIA (like Apple) chose to artificially limit the amount of GPU block/threadgroup parallelism on OpenCL to force developers to migrate to their bespoke APIs. Both NVIDIA and Apple allow a maximum parallelism of 256 threads within a block/threadgroup, whereas both actually support 1024 if you use their APIs.
The reason this helps Anukari is that for large presets that contain more than 256 objects, on OpenCL it would have to process more than one object per thread, per audio sample. This roughly doubled the processing latency once that 257th object was added. And furthermore it went up again at 513, where each thread had to loop over three objects, and so on.
Now much larger presets are possible with very good performance. On Windows, at least, I am finally pretty dang happy with Anukari's performance. MacOS is getting very close, and with a buffer size of 512 it's solid. But with smaller buffer sizes it's still not quite perfect.
Fortunately, I have two more big optimizations up my sleeve. The biggest one is that now that I've ported to both Metal and CUDA, with the larger thread counts, I will be able to do further optimization with regards to warp divergence. I wrote about this a bit last year when I originally implemented warp alignment, which at that time was the biggest optimization to date.
But now with greater parallelism, I have access to way more warps. And in the meantime, I have added quite a few new object types with highly-divergent code paths. So it's time to improve the warp alignment so that the objects running within each warp have much better branch homogeneity. I am expecting some very substantial performance improvements for highly-complex instruments that make use of several kinds of modulators and exciters.
The old algorithm was a kind of greedy brute-force search for the best warp layout. But that was with 4 entity types, and now there are 11. Brute-force is no longer feasible within the necessary latency constraints. So I've devised a heuristic search algorithm. Basically it starts by creating what it thinks would be the optimal layout, and if that layout fits the hardware, it uses it. Otherwise, it will iterate over a priority-ordered list of possible compressions of the layout, taking the one that has the least performance drag. It will do this in a loop until finding a layout that fits the hardware. This algorithm will be guaranteed to find a solution because the last possible compression it will try is "just delete some padding between warps," which in the worst-case will just result in a layout with no warp alignment whatsoever.
Anyway, here's a demo using the latest CUDA version of the simulation, which allows two voice-instanced instruments to run in parallel (with OBS capturing the screen, which is very bad for GPU performance):
Waste makes haste...?
Captain's Log: Stardate 78324
I've been really digging into MacOS optimizations over the last few days. Being ALU-bound is quite a pain in the butt, because unlike being memory-bound, there are a lot fewer big changes I can make to speed things up. Mostly I've been working on instruction-level optimizations, none of which have had a big impact. I've gotten to where I don't see anything else that is really worth optimizing at this level. I've confirmed that by simply commenting out pieces of code and measuring the speedup from not running it, which gives me an upper bound on how much optimizing it could help. Reducing computation at this point is not going to be done at the instruction level.
I have a couple ideas for larger structural/algorithmic changes, but before I move on to this, I want to eliminate a few other issues that I've noticed with MacOS.
The biggest pain I've run into is the fact that MacOS gives the user very little (almost no) control over the power/performance state of the machine. I guess this is user-friendly, as Apple magically figures out how best to clock the CPU and GPU up and down to save power, but it turns out that Apple is doing this really badly for Anukari's use case.
From what I can tell, Apple's GPU clock throttling is based on something akin to a load average. This is a concept I originally ran into from Unix, where the load average is, roughly speaking, the average number of threads that are awaiting execution at any moment. If this number is higher than the number of CPU cores, it means that there's a backlog: there are regularly threads that want to run, but can't.
On the Apple GPU, it seems that the OS won't clock it up until there's a command queue with a high load average. Which makes sense for throughput-sensitive use cases, but doesn't work at all for a latency-sensitive use case. For example, for graphics, as long as the command queue isn't backing up (load average > 1), there's no need to up-clock the GPU: it is keeping up. But for Anukari, this heuristic doesn't work, because if it ever hits a load average even close to 1, that means that there's no wallclock time left over for e.g. the DAW to process the audio block, etc. It's already too late, and audio glitches are regularly occurring even at something like 0.8.
This is a serious problem. I asked on the Apple Developer Forums (optimistic, I know), and though I did get a thoughtful response, it wasn't from an Apple developer, and it didn't help anyway. I ran numerous experiments with queuing more simulation kernels in advance, but ultimately none of these ideas helped, because ultimately I can't really generate audio in advance because I need up-to-date MIDI input, etc.
Ultimately, the solution that I am going with is probably the stupidest code that I will have ever shipped to production in my career: Anukari is dedicating one GPU threadgroup warp to useless computation, simply to keep it fully saturated to communicate to the OS that the GPU needs to be up-clocked. So in this case, waste makes haste.
But while this is incredibly stupid, it is also incredibly effective. I got it working with the minimal amount of power usage by dedicating the smallest amount of compute possible to spinning (not even a whole threadgroup, just one warp). And it immediately gets the OS to clock up the GPU, which decreases the audio computation latency by about 40%.
My new golden test latency stats framework doesn't show as much of a gain as I see when using the plugin, because it was already running the GPU with smaller gaps, as it doesn't need to wait for real time to render the audio. But even in the tests, the performance improvement is dramatic. In this graph you can see the overall real time ratio for the tests. The huge drop at iteration 15 is where I implemented the GPU spin:
Here is a graph with all the tests broken out individually:
At this point, on my Macbook M1, the standalone app performance is very usable, even for quite complicated presets like the slinky-resocube.
But... of course nothing is ever simple, and somehow while these fantastic performance gains are very observable when running as a VST3 plugin in Ableton, they are not nearly as visible when running as an AU plugin in GarageBand. I don't know what is up here, but it needs to be the next performance issue that I address, because if I can get past this, I might have performance at a "good enough" level on MacOS to start spending my time elsewhere for a while.