devlog > gpu

Getting more and more stable

Captain's Log: Stardate 78592.1

The buffer-clearing saga

Adding the new AnukariEffect plugin has ended up precipitating a lot of improvements to Anukari, because it pushed me into testing what happens when multiple instances of the plugin are running at the same time. Most of my testing is done in the standalone Anukari application. It loads extremely quickly, so it's nice for quickly iterating on a new UX change, etc. But in reality, it's likely that users will mostly use Anukari as a plugin, so obviously I need to give that configuration ample attention.

The last big issue I ran into with the plugin was that in GarageBand, loading a song that had something like 6 instances of Anukari and AnukariEffect, sometimes one of the instances would mysteriously fail. The GPU code would initialize just fine, but GPU call to process the first audio block would fail with the very helpful Metal API error, Internal Error (0000000e:Internal Error), unknown reason.

After some research, it turned out that to get a more detailed error from the Metal API, you have to explicitly enable it with MTLCommandBufferDescriptor::errorOptions, and then dig it out of the NSError.userInfo map in an obscure and esoteric manner. So I had my intern (ChatGPT) figure out how to do that and finally I got a "more detailed" error message from the Metal API: IOGPUCommandQueueErrorDomain error 14.

If you've followed my devlog for a while, it should come as no surprise that I am a bit cynical about Apple's developer documentation. So I was completely unsurprised to find that this error is not documented anywhere in Apple's official documents. Apple just doesn't do that sort of thing.

Anyway, I found various mentions of similar errors, with speculation that they were caused by invalid memory accesses, or by kernels that ran too long. I used the Metal API validation tools to check for any weird memory access and they didn't find anything weird. I figured they wouldn't, since I have some pretty abusive fuzz tests that I've run with Metal API validation enabled, and invalid memory access almost certainly would have shown up before.

So I went with the working hypothesis that the kernel was running too long and hitting some kind of GPU watchdog timer. But this was a bit confusing, since the Anukari physics simulation kernel is, for obvious reasons, designed to be extremely fast. With some careful observation and manual bisection of various code features, I realized that it was definitely not the physics kernel, but rather it was the kernel that is used to clear the GPU-internal audio sample buffer.

Some background: Anukari supports audio delay lines, and so it needs to be able to store 1 second of audio history for each Microphone that might be tapped by a delay line. To avoid allocations during real-time audio synthesis, memory is allocated up-front for the maximum number of Microphones, which is 50. But also note that there can be 50 microphones per voice instance, and there can be 16 voice instances. Long story short, the per-microphone, per-instance, per-channel buffer for 1 second of audio is about 300 MB, which is kind of huge.

It's obvious that clearing such a buffer needs to be done locally on the GPU, since transferring a bunch of zeros from the CPU to the GPU would be stupid and slow. So Anukari had a kernel that would clear the buffer at startup, or at other times when it was considered "dirty" due to various possible events (or if the user requested a physics reset).

Now imagine 6 instances of Anukari all being initialized in parallel, and each instance is trying to clear 300 MB of RAM -- that's multiple gigabytes of memory write bandwidth. And sometimes one of those kernels would get delayed or slowed enough to time out. The problem only gets worse with more instances.

Initially I considered a bunch of ideas for how to clear this memory in a more targeted way. We might clear only the memory for microphones that are actually in use. But then we have to track which microphones are live. And also, the way the memory is strided, it's not all that clear that this would help, because we'd still be touching a huge swath of memory.

I came up with a number of other schemes of increasing complexity, which was unsatisfying because complexity is basically my #1 enemy at the moment. Almost all the bugs I'm wrangling at this point have to do with things being so complex that there were corner-cases that I didn't handle.

At this point you might be asking yourself: why does all this memory need to be cleared, anyway? That's a good question, which I should have asked earlier. The simple answer is that if a new delay line is created, we want to make sure that the audio samples it reads are silent in the case that they haven't been written yet by their associated microphone. For example, at startup.

But then that raises the question: couldn't we just avoid reading those audio samples somehow? For example, by storing information about the oldest sample number for which the data in a given sample stream is valid, and consulting that low-watermark before reading the samples.

The answer is yes, we could do that instead. And in a massive face-palm moment, I realized that I had already implemented this timestamp for microphones. So in other words, the memory clearing was completely unnecessary, because the GPU code was already keeping track of the oldest valid audio sample for each stream. I think what happened is that I wrote the buffer-clearing code before the low-watermark code, and forgot to remove the buffer-clearing code. And then forgot that I wrote the low-watermark code.

Well, that's not quite the whole story. In addition to the 50 microphone streams, there are 2 streams to represent the stereo external audio input, which can also be tapped by delay lines (to inject audio into the system as an effect processor). This data did not have a low-watermark, and thus the clearing was important.

However for external audio, a low-watermark is much simpler: it's just sample number 0. This is because external audio is copied into the GPU buffer on every block, and so it never has gaps. The Microphone streams can have gaps, because a Microphone can be deleted and re-added, etc. But for external audio, the GPU code just needs to check that it's not reading anything prior to sample 0, and after that it can always assume the data is valid.

Thus ultimately the fix here was to just add 2 lines of GPU code to check the buffer access for external audio streams, and then to delete a couple hundred lines of CPU/GPU code responsible for clearing the internal buffer, marking it as dirty, etc. This resulted in a noticeable speedup for loading Anukari and completely solved the issue of unreliable initialization in the presence of multiple instances.

Pre-alpha release 0.0.13

With the last reliability bug (that I know of) solved, I was finally able to cut a new pre-alpha release this Friday. I'm super stoked about this release. It has a huge number of crash fixes, bug fixes, and usability enhancements. It also turned out to be the right time to add a few physics features that I felt were necessary before the full release. The details of what's in 0.0.13 are in the release notes and in older devlog entries, so I won't go into them here, but this release is looking pretty dang good.

The next two big things on my radar are AAX support and more factory presets. On the side I've been working to get the AAX certificates, etc., needed to release an AAX plugin, and I think that it should be pretty straightforward to get this working (famous last words). And for factory presets, I have about 50 right now but would like to release with a couple hundred. This is especially important now that I've added AnukariEffect, since only a couple of the current presets are audio effects -- most of them are instruments. So I'm kind of starting from scratch there. I think it's pretty vital to have a really great library of factory presets for both instruments and effects, and also, working on them is a great way to find issues with the plugin.

More workarounds for Apple

Captain's Log: Stardate 78573.2

Automatic Bypassing Workaround

While testing the new AnukariEffect plugin in various DAWs for compatibility, I found that it was doing some very strange stuff in GarageBand (and Logic Pro, which seems to share the same internals). I had noticed weird stuff in GarageBand before even with the instrument plugin, and had a TODO to do a deep dive, so I figured that now was as good a time as any to finally get the plugin working well with Apple's DAWs.

What I had seen in the past with the Anukari (instrument) plugin was that sometimes the physics simulation would inexplicably stop working. I had seen this at GarageBand startup, but also after it had been open for a while. I couldn't see any reason in Anukari's logs for the problem, and occasionally it would just start working again. But this was fairly rare and I hadn't had time to find a way to reproduce it.

But with the AnukariEffect plugin, this was happening constantly. Since it was easy to reproduce, I pretty quickly found out that GarageBand will simply stop calling into the plugin's ProcessBlock function, which is where audio processing happens, and in Anukari, is where the physics simulation occurs.

It turns out that GarageBand is extremely aggressive about this. It has some heuristics about when a plugin is no longer producing audio, and at that time it will stop calling into the plugin, to save CPU/power. For example, for an instrument, if it hasn't received MIDI input in a while it might be automatically bypassed. And for an effect, if the track is not playing or the effect is not receiving audio input, it will be automatically bypassed.

This is reasonable behavior, and other DAWs do it too, but for example in a VST3 plugin (as opposed to an AudioUnit) the plugin can specify the number of "tail samples" as being kInfiniteTail, in other words, the plugin can state "I might keep generating audio samples forever even without input." VST3 plugins can also set their sub-type to Generator, which also communicates to the DAW that they might continue to generate audio even without input. (Note that an AudioUnit can be a generator at the top level, but instruments/effects can't also be generators. Which is a pretty big oversight.)

And other DAWs that do aggressive automatic bypassing, like Cubase or Nuendo, they provide the option to disable the feature. But of course having a knob like this is anathema to Apple, especially in GarageBand, and thus it cannot be disabled.

Anyway, for an instrument or effect plugin running in an Apple DAW, aggressive automatic bypassing is just a fact of life. And if that plugin is a continuous physics simulation like Anukari, this is a huge problem, because the simulation part of the plugin will become unresponsive, and furthermore, weird discontinuous things may happen if it is bypassed and un-bypassed at inopportune moments.

So as usual for working with Apple, the solution to Apple's oversimplification of the problem is to push more complexity into the non-Apple software: Anukari now can detect that it has been automatically bypassed, and will seamlessly transfer ownership the physics simulation to a background thread. When DAW processing resumes, it seamlessly transfers ownership back. This is optional (but highly recommended), so users who really need to save power can disable it.

This really is much more complicated than I'd like, partly because Apple doesn't provide any indication that the plugin is bypassed. From what I can tell there's no notification whatsoever, except that ProcessBlock stops getting called. So detecting this condition requires a keepalive timer and a background thread that monitors it. Once it detects that ProcessBlock hasn't run for too long, it begins running the simulation directly. Then when ProcessBlock resumes being called, it detects that the keepalive is fresh again and stops.

There are some very tricky details here to do all this reliably in the real-time audio thread without priority inversion issues with a mutex. The keepalive timer is an atomic, and the monitoring thread never acquires the mutex unless the keepalive is stale. This does mean that the audio thread has to acquire the mutex for each audio block, but because we use the atomic keepalive timer to guarantee that the mutex will never be contended, this is OK, because on all the platforms where Anukari will run, an uncontended mutex acquisition is simply an atomic CAS operation. (This is a great tip I learned from Fabian Renn-Giles in his excellent ADC23 talk.)

There is one moment where the audio thread's mutex acquisition could be contended, which is when the automatic bypass is being lifted. The monitoring thread may be holding it while running the simulation itself. This is not a big deal though, because the monitoring thread releases the mutex after simulating each small sample block. The audio thread will try to acquire the mutex, fail, and return a silent buffer. But in doing so it will update the keepalive timer, and next time it runs it will acquire the mutex without contention. The reason this dropped block is not a big deal is that we're coming back from being bypassed anyway -- this just adds a few samples of latency before audio starts. No problem.

There's one last detail, which is that while all the above complexity keeps the physics simulation running, so the user can continue to interact with the plugin, there will not be any audio output while it's bypassed. This cannot be fixed, and could be a little confusing. So Anukari now displays a pulsating "BYPASSED" message on the master output level meter when it is in this state. And that message has a tooltip explaining how the DAW is doing potentially annoying things.

Less Waste Still Makes Haste

In my previous post Waste Makes Haste I wrote about how Anukari has to run a spin loop on a single GPU core to convince MacOS to actually clock up the GPU so that it performs well enough for Anukari to function.

That workaround continues to be extremely effective. However, while testing multiple Anukari instances, I realized that each instance was running a spin loop on the GPU, so e.g. 4 instances would run 4 spin loops. Running the one spin loop is pretty stupid, but gets the job done and is well worth it. But running 4 spin loops is purely wasteful, since only the 1 is required to get the GPU clocked up.

Fixing this requires coordination among all Anukari audio threads within the same process. Somehow a single audio thread needs to run the spin loop, and the others need to just do regular audio processing. But if the first thread running the loop is e.g. bypassed, another thread needs to pick up the work, and so on.

I ended up devising another overly-complicated solution here, which is to use another shared atomic keepalive timer. Each audio thread checks it periodically to see if it has expired, and if so, attempts a CAS to update it. If that CAS fails, it means some other thread got to it first. If the CAS succeeds, it means that this thread now owns the spin loop and needs to keep updating the keepalive. There are some other details, but this algorithm turned out to be mercifully easy to get right with just a couple of CAS operations and a nano timer. (And it doesn't even require that each thread sees only unique nano timestamps!)

An alternative solution would be to have an entirely separate thread run the GPU spin loop, instead of having an audio thread be responsible for tending to it. This could also be a good solution. However it would require its own tricky details so that the spin loop would pause when all audio threads were bypassed. And also it would require initializing some Metal state that each audio thread already initializes anyway. I will probably keep the current solution unless it proves unreliable, in which case I'll move to this alternative.

Mouse Cursor Hell

The last workaround I spent time on this last week is making custom mouse cursors work well in GarageBand and Logic.

Since Anukari has a somewhat sophisticated 3D editor, custom mouse cursors are very useful for helping make it clear what is happening. So for example, when the user drags the right mouse button to rotate the camera, the mouse cursor changes to a little rotation icon for the duration of the drag, and then goes back to being a pointer when the button is released.

Or, well, it goes back to being a pointer in every DAW except GarageBand and Logic, because of course Apple is doing something fucking weird with the mouse cursor in their DAWs. Humorously, as I investigated this issue, I discovered that the mouse cursor often gets stuck in GarageBand/Logic even without plugins, and that users have been complaining about this for at least 10 years. One user in a forum post basically said, "don't worry about the busted mouse cursors so much, you just get used to it." So Apple has been ignoring an obvious mouse cursor bug for a decade. Sounds about right.

Anyway, I narrowed down the problem to the fact that changing the mouse cursor using [NSCursor set] inside of a mouseDown or mouseUp event sometimes doesn't work. Err, it does work, in the sense that the call succeeds, and if you call [NSCursor currentCursor] it will return the one you just set. But visually the cursor will not change.

I tried about a billion things, and ultimately ended up with a workaround that force-sets the mouse cursor inside the next mouseMove or mouseDrag event (described in more detail here on the JUCE forums). This is not perfect, but it's pretty good, and much better than no workaround at all.

Yikes... as you can tell I'm pretty sick of dealing with compatibility with Apple's DAWs. But I'm not done yet. There are two more Apple-specific issues that I'm aware of, which hopefully I can address over the next week.

The new warp alignment optimizer

As I mentioned in the previous post, I've been working on writing a better algorithm for optimizing the way that entities are aligned to GPU warps (or as Apple calls them, SIMD-groups). For the sake of conversation, let's assume that each GPU threadgroup is 1024 threads, and those are broken up into 32 warps, each of which has 32 threads. (These happen to be actual numbers from both my NVIDIA chip and my Apple M1.)

Each warp shares an instruction pointer. This is why Apple's name for them makes sense: each SIMD-group is kind of like a 32-data-wide SIMD unit. In practice these are some pretty sophisticated SIMD processors, because they can do instruction masking, allowing each thread to take different branches. But the way this works for something like "if (x) y else z" is that all the SIMD unit executes instructions for BOTH y and z on every thread, but the y instruction is masked out (has no effect) for threads that take the z branch, and z is masked out for threads that take the y branch. This is not a huge deal if y and z are simple computations, but each branch has dozens of instructions, you have to wait for each branch to be executed in serial, which is slow.

Note that this penalty is only paid if there are actually threads that take both branches. If all the threads take the same branch, no masking is needed and things are fast. This is the key thing: at runtime, putting computations with similar branch flows in the same warp is much faster than mixing computations with divergent branch flows.

For Anukari, the most obvious way to do this is to group entities by type. Put sensors in a warp with other sensors, put bodies in a group with other bodies, etc. In total, including sub-types, Anukari has 11 entity types. This means that for small instruments, we can easily sort each entity type into its own warp, and get huge speedups. This is really the main advantage of porting from OpenCL to CUDA: just like Apple, NVIDIA artificially limits OpenCL to just 8 32-thread warps (256 threads). If you want to use the full 32 32-thread warps (1024 threads), you have to port to the hardware's native language. Which, we really do, because with OpenCL's 8 warps, we're much more likely to have to double-up two entity types in one warp. Having 32 warps gives us a ton of flexibility.

The Algorithm

So we have 11 entity types going into 32 buckets. This is easy until we consider that an instrument may have hundreds of one entity type, and zero of another, or that instruments might have 33 of one entity type, which doesn't fit into a single bucket, etc. What we have here is an optimization problem. It is related to the bin-packing problem, but it's more complicated than the vanilla bin-packing problem because there are additional constraints, like the fact that we can break groups of entities into sub-groups if needed, and the fact that it needs to run REALLY fast because this happens on the audio thread whenever we need to write entities to buffers.

I'm extremely happy with the solution I ended up with. First, we simplify the problem:

  • Each entity type is grouped together into a contiguous unit, which might internally have padding but will never be separated by another entity type.
  • We do not consider reordering entity types: there is a fixed order that we put them in and that's it. This order is hand-chosen so that the most expensive entities are not adjacent to one another, and thus are unlikely to end up merged into the same warp.

A quick definition: a warp's "occupancy" will be the number of distinct types of entities that have been laid out within that warp. So if a warp contains some sensors, and some LFOs, its occupancy would be 2.

The algorithm then is as follows:

  1. Pretend we have infinite warps, and generate a layout that would be optimal. Basically, assign each entity type enough warps such that the maximum occupancy is 1. (This means that some warps might be right-padded with no-op entities.)
  2. If the current layout fits into the actual amount of warps the hardware has, we are done.
  3. If not, look for any cases where dead space can be removed without increasing any warp's occupancy. (On the first iteration, there won't be any.)
  4. Increment the maximum allowable occupancy by 1, and merge together any adjacent warps that, after being merged, will not exceed this occupancy level.
  5. Go back to step 2.

That's it! This is a minimax optimizer: it tries to minimize the maximum occupancy of any warp. It does this via the maximum allowable occupancy watermark. It tries all possible merges that would stay within a given occupancy before trying the next higher occupancy.

There are a couple of tricks to make this efficient, but the main one is that at the start of the algorithm, we make a conservative guess as to what the maximum occupancy will have to be. This way, if the solution requires occupancy 11 (say there are 1000 bodies and 1 of each remaining entity type, so the last warp has to contain all 11 types), we don't have to waste time merging things for occupancy 2, 3, 4, ... 11. It turns out that it's quite easy to guess within 1-2 occupancy below the true occupancy most of the time. I wrote a fuzz test for the algorithm, and in 5,000 random entity distributions the worst case is 5 optimizer iterations, and that's rare. Anyway, it's plenty fast enough.

The solutions the optimizer produces are excellent. In cases where there's a perfect solution available, it always gets it, because that's what it tries first. And in typical cases where compromise is needed, it usually finds solutions that are as good as what I could come up with manually.

Results

That's all fine and good, but does it work? Yes. Sadly I don't have a graph to share, because this doesn't help with all my microbenchmarks. Those are all tiny instruments for which the old optimizer worked fine.

But for running huge complex instruments, it is an ENORMOUS speedup, often up to 2x faster with the new optimizer. For example, for the large instrument in this demo video, it previously was averaging about 90% of the latency budget, with very frequent buffer overruns (the red clip lines in the GPU meter). That instrument is now completely usable with no overruns at all, averaging maybe 40% of the latency budget. Other benchmark instruments show even better gains, with one that never went below 100% of the latency budget before now at about 40%.

This opens up a TON more possibilities in terms of complex instruments. I think at this point, at least on Windows with NVIDIA hardware, I am completely satisfied with the performance. Apple with Metal is almost there but still needs just a tiny bit more work for me to be satisfied.

Loading...

© 2025 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.