devlog > bug

Getting more and more stable

Captain's Log: Stardate 78592.1

The buffer-clearing saga

Adding the new AnukariEffect plugin has ended up precipitating a lot of improvements to Anukari, because it pushed me into testing what happens when multiple instances of the plugin are running at the same time. Most of my testing is done in the standalone Anukari application. It loads extremely quickly, so it's nice for quickly iterating on a new UX change, etc. But in reality, it's likely that users will mostly use Anukari as a plugin, so obviously I need to give that configuration ample attention.

The last big issue I ran into with the plugin was that in GarageBand, loading a song that had something like 6 instances of Anukari and AnukariEffect, sometimes one of the instances would mysteriously fail. The GPU code would initialize just fine, but GPU call to process the first audio block would fail with the very helpful Metal API error, Internal Error (0000000e:Internal Error), unknown reason.

After some research, it turned out that to get a more detailed error from the Metal API, you have to explicitly enable it with MTLCommandBufferDescriptor::errorOptions, and then dig it out of the NSError.userInfo map in an obscure and esoteric manner. So I had my intern (ChatGPT) figure out how to do that and finally I got a "more detailed" error message from the Metal API: IOGPUCommandQueueErrorDomain error 14.

If you've followed my devlog for a while, it should come as no surprise that I am a bit cynical about Apple's developer documentation. So I was completely unsurprised to find that this error is not documented anywhere in Apple's official documents. Apple just doesn't do that sort of thing.

Anyway, I found various mentions of similar errors, with speculation that they were caused by invalid memory accesses, or by kernels that ran too long. I used the Metal API validation tools to check for any weird memory access and they didn't find anything weird. I figured they wouldn't, since I have some pretty abusive fuzz tests that I've run with Metal API validation enabled, and invalid memory access almost certainly would have shown up before.

So I went with the working hypothesis that the kernel was running too long and hitting some kind of GPU watchdog timer. But this was a bit confusing, since the Anukari physics simulation kernel is, for obvious reasons, designed to be extremely fast. With some careful observation and manual bisection of various code features, I realized that it was definitely not the physics kernel, but rather it was the kernel that is used to clear the GPU-internal audio sample buffer.

Some background: Anukari supports audio delay lines, and so it needs to be able to store 1 second of audio history for each Microphone that might be tapped by a delay line. To avoid allocations during real-time audio synthesis, memory is allocated up-front for the maximum number of Microphones, which is 50. But also note that there can be 50 microphones per voice instance, and there can be 16 voice instances. Long story short, the per-microphone, per-instance, per-channel buffer for 1 second of audio is about 300 MB, which is kind of huge.

It's obvious that clearing such a buffer needs to be done locally on the GPU, since transferring a bunch of zeros from the CPU to the GPU would be stupid and slow. So Anukari had a kernel that would clear the buffer at startup, or at other times when it was considered "dirty" due to various possible events (or if the user requested a physics reset).

Now imagine 6 instances of Anukari all being initialized in parallel, and each instance is trying to clear 300 MB of RAM -- that's multiple gigabytes of memory write bandwidth. And sometimes one of those kernels would get delayed or slowed enough to time out. The problem only gets worse with more instances.

Initially I considered a bunch of ideas for how to clear this memory in a more targeted way. We might clear only the memory for microphones that are actually in use. But then we have to track which microphones are live. And also, the way the memory is strided, it's not all that clear that this would help, because we'd still be touching a huge swath of memory.

I came up with a number of other schemes of increasing complexity, which was unsatisfying because complexity is basically my #1 enemy at the moment. Almost all the bugs I'm wrangling at this point have to do with things being so complex that there were corner-cases that I didn't handle.

At this point you might be asking yourself: why does all this memory need to be cleared, anyway? That's a good question, which I should have asked earlier. The simple answer is that if a new delay line is created, we want to make sure that the audio samples it reads are silent in the case that they haven't been written yet by their associated microphone. For example, at startup.

But then that raises the question: couldn't we just avoid reading those audio samples somehow? For example, by storing information about the oldest sample number for which the data in a given sample stream is valid, and consulting that low-watermark before reading the samples.

The answer is yes, we could do that instead. And in a massive face-palm moment, I realized that I had already implemented this timestamp for microphones. So in other words, the memory clearing was completely unnecessary, because the GPU code was already keeping track of the oldest valid audio sample for each stream. I think what happened is that I wrote the buffer-clearing code before the low-watermark code, and forgot to remove the buffer-clearing code. And then forgot that I wrote the low-watermark code.

Well, that's not quite the whole story. In addition to the 50 microphone streams, there are 2 streams to represent the stereo external audio input, which can also be tapped by delay lines (to inject audio into the system as an effect processor). This data did not have a low-watermark, and thus the clearing was important.

However for external audio, a low-watermark is much simpler: it's just sample number 0. This is because external audio is copied into the GPU buffer on every block, and so it never has gaps. The Microphone streams can have gaps, because a Microphone can be deleted and re-added, etc. But for external audio, the GPU code just needs to check that it's not reading anything prior to sample 0, and after that it can always assume the data is valid.

Thus ultimately the fix here was to just add 2 lines of GPU code to check the buffer access for external audio streams, and then to delete a couple hundred lines of CPU/GPU code responsible for clearing the internal buffer, marking it as dirty, etc. This resulted in a noticeable speedup for loading Anukari and completely solved the issue of unreliable initialization in the presence of multiple instances.

Pre-alpha release 0.0.13

With the last reliability bug (that I know of) solved, I was finally able to cut a new pre-alpha release this Friday. I'm super stoked about this release. It has a huge number of crash fixes, bug fixes, and usability enhancements. It also turned out to be the right time to add a few physics features that I felt were necessary before the full release. The details of what's in 0.0.13 are in the release notes and in older devlog entries, so I won't go into them here, but this release is looking pretty dang good.

The next two big things on my radar are AAX support and more factory presets. On the side I've been working to get the AAX certificates, etc., needed to release an AAX plugin, and I think that it should be pretty straightforward to get this working (famous last words). And for factory presets, I have about 50 right now but would like to release with a couple hundred. This is especially important now that I've added AnukariEffect, since only a couple of the current presets are audio effects -- most of them are instruments. So I'm kind of starting from scratch there. I think it's pretty vital to have a really great library of factory presets for both instruments and effects, and also, working on them is a great way to find issues with the plugin.

More workarounds for Apple

Captain's Log: Stardate 78573.2

Automatic Bypassing Workaround

While testing the new AnukariEffect plugin in various DAWs for compatibility, I found that it was doing some very strange stuff in GarageBand (and Logic Pro, which seems to share the same internals). I had noticed weird stuff in GarageBand before even with the instrument plugin, and had a TODO to do a deep dive, so I figured that now was as good a time as any to finally get the plugin working well with Apple's DAWs.

What I had seen in the past with the Anukari (instrument) plugin was that sometimes the physics simulation would inexplicably stop working. I had seen this at GarageBand startup, but also after it had been open for a while. I couldn't see any reason in Anukari's logs for the problem, and occasionally it would just start working again. But this was fairly rare and I hadn't had time to find a way to reproduce it.

But with the AnukariEffect plugin, this was happening constantly. Since it was easy to reproduce, I pretty quickly found out that GarageBand will simply stop calling into the plugin's ProcessBlock function, which is where audio processing happens, and in Anukari, is where the physics simulation occurs.

It turns out that GarageBand is extremely aggressive about this. It has some heuristics about when a plugin is no longer producing audio, and at that time it will stop calling into the plugin, to save CPU/power. For example, for an instrument, if it hasn't received MIDI input in a while it might be automatically bypassed. And for an effect, if the track is not playing or the effect is not receiving audio input, it will be automatically bypassed.

This is reasonable behavior, and other DAWs do it too, but for example in a VST3 plugin (as opposed to an AudioUnit) the plugin can specify the number of "tail samples" as being kInfiniteTail, in other words, the plugin can state "I might keep generating audio samples forever even without input." VST3 plugins can also set their sub-type to Generator, which also communicates to the DAW that they might continue to generate audio even without input. (Note that an AudioUnit can be a generator at the top level, but instruments/effects can't also be generators. Which is a pretty big oversight.)

And other DAWs that do aggressive automatic bypassing, like Cubase or Nuendo, they provide the option to disable the feature. But of course having a knob like this is anathema to Apple, especially in GarageBand, and thus it cannot be disabled.

Anyway, for an instrument or effect plugin running in an Apple DAW, aggressive automatic bypassing is just a fact of life. And if that plugin is a continuous physics simulation like Anukari, this is a huge problem, because the simulation part of the plugin will become unresponsive, and furthermore, weird discontinuous things may happen if it is bypassed and un-bypassed at inopportune moments.

So as usual for working with Apple, the solution to Apple's oversimplification of the problem is to push more complexity into the non-Apple software: Anukari now can detect that it has been automatically bypassed, and will seamlessly transfer ownership the physics simulation to a background thread. When DAW processing resumes, it seamlessly transfers ownership back. This is optional (but highly recommended), so users who really need to save power can disable it.

This really is much more complicated than I'd like, partly because Apple doesn't provide any indication that the plugin is bypassed. From what I can tell there's no notification whatsoever, except that ProcessBlock stops getting called. So detecting this condition requires a keepalive timer and a background thread that monitors it. Once it detects that ProcessBlock hasn't run for too long, it begins running the simulation directly. Then when ProcessBlock resumes being called, it detects that the keepalive is fresh again and stops.

There are some very tricky details here to do all this reliably in the real-time audio thread without priority inversion issues with a mutex. The keepalive timer is an atomic, and the monitoring thread never acquires the mutex unless the keepalive is stale. This does mean that the audio thread has to acquire the mutex for each audio block, but because we use the atomic keepalive timer to guarantee that the mutex will never be contended, this is OK, because on all the platforms where Anukari will run, an uncontended mutex acquisition is simply an atomic CAS operation. (This is a great tip I learned from Fabian Renn-Giles in his excellent ADC23 talk.)

There is one moment where the audio thread's mutex acquisition could be contended, which is when the automatic bypass is being lifted. The monitoring thread may be holding it while running the simulation itself. This is not a big deal though, because the monitoring thread releases the mutex after simulating each small sample block. The audio thread will try to acquire the mutex, fail, and return a silent buffer. But in doing so it will update the keepalive timer, and next time it runs it will acquire the mutex without contention. The reason this dropped block is not a big deal is that we're coming back from being bypassed anyway -- this just adds a few samples of latency before audio starts. No problem.

There's one last detail, which is that while all the above complexity keeps the physics simulation running, so the user can continue to interact with the plugin, there will not be any audio output while it's bypassed. This cannot be fixed, and could be a little confusing. So Anukari now displays a pulsating "BYPASSED" message on the master output level meter when it is in this state. And that message has a tooltip explaining how the DAW is doing potentially annoying things.

Less Waste Still Makes Haste

In my previous post Waste Makes Haste I wrote about how Anukari has to run a spin loop on a single GPU core to convince MacOS to actually clock up the GPU so that it performs well enough for Anukari to function.

That workaround continues to be extremely effective. However, while testing multiple Anukari instances, I realized that each instance was running a spin loop on the GPU, so e.g. 4 instances would run 4 spin loops. Running the one spin loop is pretty stupid, but gets the job done and is well worth it. But running 4 spin loops is purely wasteful, since only the 1 is required to get the GPU clocked up.

Fixing this requires coordination among all Anukari audio threads within the same process. Somehow a single audio thread needs to run the spin loop, and the others need to just do regular audio processing. But if the first thread running the loop is e.g. bypassed, another thread needs to pick up the work, and so on.

I ended up devising another overly-complicated solution here, which is to use another shared atomic keepalive timer. Each audio thread checks it periodically to see if it has expired, and if so, attempts a CAS to update it. If that CAS fails, it means some other thread got to it first. If the CAS succeeds, it means that this thread now owns the spin loop and needs to keep updating the keepalive. There are some other details, but this algorithm turned out to be mercifully easy to get right with just a couple of CAS operations and a nano timer. (And it doesn't even require that each thread sees only unique nano timestamps!)

An alternative solution would be to have an entirely separate thread run the GPU spin loop, instead of having an audio thread be responsible for tending to it. This could also be a good solution. However it would require its own tricky details so that the spin loop would pause when all audio threads were bypassed. And also it would require initializing some Metal state that each audio thread already initializes anyway. I will probably keep the current solution unless it proves unreliable, in which case I'll move to this alternative.

Mouse Cursor Hell

The last workaround I spent time on this last week is making custom mouse cursors work well in GarageBand and Logic.

Since Anukari has a somewhat sophisticated 3D editor, custom mouse cursors are very useful for helping make it clear what is happening. So for example, when the user drags the right mouse button to rotate the camera, the mouse cursor changes to a little rotation icon for the duration of the drag, and then goes back to being a pointer when the button is released.

Or, well, it goes back to being a pointer in every DAW except GarageBand and Logic, because of course Apple is doing something fucking weird with the mouse cursor in their DAWs. Humorously, as I investigated this issue, I discovered that the mouse cursor often gets stuck in GarageBand/Logic even without plugins, and that users have been complaining about this for at least 10 years. One user in a forum post basically said, "don't worry about the busted mouse cursors so much, you just get used to it." So Apple has been ignoring an obvious mouse cursor bug for a decade. Sounds about right.

Anyway, I narrowed down the problem to the fact that changing the mouse cursor using [NSCursor set] inside of a mouseDown or mouseUp event sometimes doesn't work. Err, it does work, in the sense that the call succeeds, and if you call [NSCursor currentCursor] it will return the one you just set. But visually the cursor will not change.

I tried about a billion things, and ultimately ended up with a workaround that force-sets the mouse cursor inside the next mouseMove or mouseDrag event (described in more detail here on the JUCE forums). This is not perfect, but it's pretty good, and much better than no workaround at all.

Yikes... as you can tell I'm pretty sick of dealing with compatibility with Apple's DAWs. But I'm not done yet. There are two more Apple-specific issues that I'm aware of, which hopefully I can address over the next week.

CPack considered harmful

Captain's Log: Stardate 78553.4

Stability

Since my last devlog update, one big thing I did was to just run the chaos monkey 24x7, fixing crashes it caused until it couldn't crash Anukari any longer. It found some highly interesting issues, including a divide-by-zero bug that has existed since probably about the 3rd week of development on Anukari, in the graphics code. In each case where it found a crash, I tried as much as possible to generalize my fixes to cover problems more broadly, and this strategy seems to have paid off as at the moment the chaos monkey hasn't crashed Anukari in about 48 hours of running.

AnukariEffect

In between solving new crashes from the chaos monkey, I continued to work on launch-blockers, one of which was finally creating a second version of the plugin that allows it to be used as an effects module in a signal chain (rather than as an instrument). This is a bit annoying, because the VST3 plugin actually is perfectly capable of being used in either context, since it dynamically determines how many audio inputs it has, etc. But most DAWs simply don't support plugins that can be used either way.

So now there is a second AnukariEffect plugin. This works great, and it's really nice to be able to just drop AnukariEffect into a track as an effect without having to do complicated sidechain stuff to use it that way. I made a couple of initial effect presets, and already it's producing some extremely cool sounds. I'm very excited about this.

A bunch of small work remains to make AnukariEffect nice to use. For example, the GUI needs to have some subtle changes, like adding a wet/dry slider, hiding controls that have to do with irrelevant MIDI inputs, etc. Also, because it doesn't receive MIDI input, AnukariEffect will only do singleton instruments and not voice-instanced instruments, so I need to put some thought into how to handle edge cases like what to do when the user loads a voice-instanced instrument in AnukariEffect. I think it will likely get converted to a singleton with a warning message to the user. But I need to experiment a bit to find what feels right.

CPack

The introduction of a second VST3 (and AU) plugin necessitated some changes to the installers. Also, separately I am working on getting an AAX plugin up and running. So I realized that now is the time to really get the installers working correctly, allowing the user to e.g. install only VST3 and not AAX, etc.

I had originally used CMake's CPack for generating the installers, using the INNOSETUP generator for Windows and the productbuild generator for MacOS. This seemed really convenient, because I didn't want to learn how to use Inno Setup / productbuild directly, and it looked like CPack could generate the installers without me having to get into the weeds.

In the end, I really wish I had never tried using CPack's generators for this. They are horrible. Basically the problem is that both Inno Setup and productbuild have fairly rich configuration languages, and CPack's generators expose perhaps 10% of the features of each one in a completely haphazard way. So it seems convenient, but then the second you need to configure something about the installer that the CPack authors did not expose as an option, you're completely hosed. Originally I tried to work around the CPack limitations with horrible hacks, such as a bash script that wrapped pkgbuild and took some special actions. But this was a complicated mess and didn't work well.

So for both Windows and MacOS, I decided to just bite the bullet and learn how to use Inno Setup and productbuild/pkgbuild directly. And as it turns out, in both cases, it is much simpler to just go straight to the nuts and bolts without the CPack generators. It resulted in less config code overall, with less indirection, no hacks, and I was able to configure the installers exactly how I wanted.

Frankly at this point I can't see any argument for why anyone would want to use CPack. It's substantially more complicated/obfuscated/indirect, limits you to an eclectic subset of each installer's features, and it truly was harder to learn how to configure CPack than to just figure out Inno Setup and productbuild/pkgbuild. The documentation for the installer tools is way better than CPack's documentation, and anyway, to use CPack you kind of have to understand the installer tools anyway.

So the end result of rewriting the Windows and MacOS installers without CPack is that they both work how I want now, and will be a lot easier to maintain as I continue to get closer to release, adding the AAX plugin and so on. I'm very happy that installers are now a "solved problem" -- one more box checked for the launch.

Loading...

© 2025 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.