devlog > macos

RSS Feed
Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Apple performance progress

Captain's Log: Stardate 78825.2

Since my last update, I've spent all my time trying to take advantage of the insights that the engineers on the Apple Metal team gave me with regards to how to convince the OS to increase the performance state such that Anukari will run super smoothly.

A big piece of news is that I bit the bullet and bought a used Macbook Pro M4 Max so that I can iterate on testing these changes much more quickly, and verify that the same tuning parameters work well on both my wimpy M1 Pro and my beastly M4 Max.

Things are working MUCH better now on the M4 Max. At least for me, I can easily run 6 plugin instances in Logic Pro with no cracking at all (512 buffer, 48 kHz). That said, there's still a weird issue that I've seen on my M1 Pro where sometimes a plugin instance will get "stuck" performing badly, and I don't know why. Simply changing the simulation backend to OpenCL and back to Metal fixes it, so it's something stateful in the simulator. More work to do!

Secret NDA stuff

One thing I've been doing is applying the details that Apple shared under NDA to tickle the OS performance state heuristics in just the right way. The nice thing here is that knowing a tiny bit more about the heuristics, I'm able to be much more efficient and effective with how I tickle them. This means that the spin loop kernel uses less GPU resources to get the same result, and thus it interferes less with the audio kernel, and uses less power. I think this is as much detail as I can go into here, but honestly it's not that interesting, just necessary.

Spin loop serialization

Separately from the NDA stuff, I ran into another issue that was likely partly to blame for the horrible performance on the most powerful machines.

The Apple API docs for MTLCommandBuffer/commit say, "The GPU starts the command buffer after it starts any command buffers that are ahead of it in the same command queue." Notice that it doesn't say that it waits for any command buffers ahead of it to finish. From what I can tell, Metal actually is quite anxious to run these buffers in parallel if it thinks that it's safe to do so from a data dependency perspective, and of course if there are cores available.

The spin loop kernel runs for a very short duration, and thus there's a CPU thread that's constantly enqueuing new command buffers with blocks of spin kernel invocations.

On my M1 Pro, it seems that the spin kernels that were queued up in these buffers ran in serial, or at least mostly in serial. But on my new M4 Max, it appears that the Metal API is extremely aggressive about running them in parallel, to the extent that it would slow down other GPU work during brief bursts of "run a zillion spin kernels at once."

The solution to this was very simple, and that was simply to use an MTLEvent to guarantee that the queued spin kernels run in serial.

Getting the spin loop out of the audio thread

The original design for the spin loop kernel was that one of the audio threads would take ownership of it, and that thread was responsible for feeding in the buffers of spin kernels to run. There's never any need to wait for the results of a spin kernel, so the audio thread was only paying the cost of encoding command buffers.

I built it this way out of sheer convenience — the audio thread had a MTLDevice and MTLLibrary handy, so it was easy to put the spin loop there. Also, errors in the spin loop were exposed to the rest of the simulator machinery which allows for retries, etc.

But that's not a great design. First, even if the encoding overhead is small, it's still overhead that's simply not necessary to force upon the audio thread. And because MTLCommandQueue will block if it's full, bugs here could have disastrous results.

So finally I moved the spin kernel tender out of the audio thread and into its own thread. The audio threads collectively use a reference-counting scheme to make sure a single spin loop CPU thread is running while any number of simulation threads are running. The spin loop thread is responsible for its own retries. This removes any risk of added latency inside the audio threads.

Inter-process spin loop deduplication

While I was rewriting the spin loop infrastructure, I decided to fix another significant issue that's been there since the beginning.

Anukari has always made sure that only a single spin loop was being run at a time by using static variables to manage it. So for example, if the user has 5 instances of Anukari running in their DAW, only one spin loop will be running.

But this has problems. For example, because Anukari and AnukariEffect are separate DLLs, they have separate static storage. That means that if a user has both plugins running in their DAW, that two spin loops would be run, because the separate DLLs don't communicate. This is bad, because any additional spin loops are wasting resources that could be used for the simulations!

Even worse, though, are DAWs that do things like run plugins in a separate process. Depending on how this is configured, it might mean that many spin loops are running at once, because each Anukari instance is in a separate process.

In the past I had looked at ways to do some kind of really simple IPC to coordinate across processes, and also across DLLs within the same process. But none of the approaches I found were workable:

  • Apple's XPC: It's super heavyweight, requiring Anukari to start a spin loop service that all the plugins communicate with.
  • POSIX semaphores: This would work except the tiny detail that they aren't automatically cleaned up when a process crashes, so a crash could leave things in a bad state that's not recoverable without a reboot.
  • File locks (F_SETLK, O_EXLOCK): Almost perfect, except that locks are per-process, not per- file descriptor. So this would not solve the issue with the two DLLs needing to coordinate: they would both be able to acquire the lock concurrently since they're in the same process.
  • File descriptor locks (F_OFD_SETLK): Perfect. Except that macOS doesn't support them.

Today, though, I found a simple approach and implemented it, and it's working perfectly, so that in the 0.9.5 release of Anukari, there is guaranteed to only ever be a single spin loop thread globally, across all processes.

It turns out the simplest way to do this was to mmap() a shared memory segment from shm_open(), and then to do the inter-process coordination using atomic variables inside that segment. I already had the atomic-based coordination scheme implemented from my original spin system, so I just transplanted that to operate on the shared memory segment, and voila, things work perfectly. The atomic-based coordination is described in the "Less Waste Still Makes Haste" section of this post.

The nice thing with this scheme is that because the coordination is based on the spin loop CPU thread updating an atomic keepalive timestamp, cleanup is not required if a process crashes or exits weirdly. The keepalive will simply go stale and another thread will take ownership, using atomic instructions to guarantee that only one thread can become the new owner.

Had a super productive conversation with an Apple Metal engineer

Well, my An Appeal to Apple post got way more traction than I had ever hoped for, especially on Hacker News but also on LinkedIn and also just generally with friends who went out of their way to help me out. So first, thank you everyone, for helping me make contact with Apple. It worked!

I just got off a call with an engineer on the Apple Metal team, and the conversation was extremely friendly and helpful.

Unfortunately I can't share much in the way of crazy technical detail, because it turns out that when I created my Apple Developer account, I signed a telephone book-sized NDA with Apple. Obviously I enjoy writing about technical stuff, so it's a bit sad that I can't do so here, but at the same time the fact that I had pre-signed an NDA allowed the Metal engineer to open up and share some extremely helpful information. So overall I can't complain.

While I can't share any technical details, I can say that Apple was very sympathetic to the Anukari cause. The engineer provided some suggestions and hints that I can use right now to maybe — just maybe — get things working in the short term. But just as importantly they already have long-term plans to make more kinds of weirdo use cases like Anukari work better, while still being super power efficient. And they'll be able to use Anukari as a test-bed for some of that work.

I now have an open line of communication with the right folks at Apple, which is great. As I work on short-term performance improvements based on the hints they provided, I plan to run some ideas by them in terms of what I might be able to write about in detail without spilling any beans that need to remain un-spilled.

Thanks again, everyone!

Getting more and more stable

Captain's Log: Stardate 78592.1

The buffer-clearing saga

Adding the new AnukariEffect plugin has ended up precipitating a lot of improvements to Anukari, because it pushed me into testing what happens when multiple instances of the plugin are running at the same time. Most of my testing is done in the standalone Anukari application. It loads extremely quickly, so it's nice for quickly iterating on a new UX change, etc. But in reality, it's likely that users will mostly use Anukari as a plugin, so obviously I need to give that configuration ample attention.

The last big issue I ran into with the plugin was that in GarageBand, loading a song that had something like 6 instances of Anukari and AnukariEffect, sometimes one of the instances would mysteriously fail. The GPU code would initialize just fine, but GPU call to process the first audio block would fail with the very helpful Metal API error, Internal Error (0000000e:Internal Error), unknown reason.

After some research, it turned out that to get a more detailed error from the Metal API, you have to explicitly enable it with MTLCommandBufferDescriptor::errorOptions, and then dig it out of the NSError.userInfo map in an obscure and esoteric manner. So I had my intern (ChatGPT) figure out how to do that and finally I got a "more detailed" error message from the Metal API: IOGPUCommandQueueErrorDomain error 14.

If you've followed my devlog for a while, it should come as no surprise that I am a bit cynical about Apple's developer documentation. So I was completely unsurprised to find that this error is not documented anywhere in Apple's official documents. Apple just doesn't do that sort of thing.

Anyway, I found various mentions of similar errors, with speculation that they were caused by invalid memory accesses, or by kernels that ran too long. I used the Metal API validation tools to check for any weird memory access and they didn't find anything weird. I figured they wouldn't, since I have some pretty abusive fuzz tests that I've run with Metal API validation enabled, and invalid memory access almost certainly would have shown up before.

So I went with the working hypothesis that the kernel was running too long and hitting some kind of GPU watchdog timer. But this was a bit confusing, since the Anukari physics simulation kernel is, for obvious reasons, designed to be extremely fast. With some careful observation and manual bisection of various code features, I realized that it was definitely not the physics kernel, but rather it was the kernel that is used to clear the GPU-internal audio sample buffer.

Some background: Anukari supports audio delay lines, and so it needs to be able to store 1 second of audio history for each Microphone that might be tapped by a delay line. To avoid allocations during real-time audio synthesis, memory is allocated up-front for the maximum number of Microphones, which is 50. But also note that there can be 50 microphones per voice instance, and there can be 16 voice instances. Long story short, the per-microphone, per-instance, per-channel buffer for 1 second of audio is about 300 MB, which is kind of huge.

It's obvious that clearing such a buffer needs to be done locally on the GPU, since transferring a bunch of zeros from the CPU to the GPU would be stupid and slow. So Anukari had a kernel that would clear the buffer at startup, or at other times when it was considered "dirty" due to various possible events (or if the user requested a physics reset).

Now imagine 6 instances of Anukari all being initialized in parallel, and each instance is trying to clear 300 MB of RAM -- that's multiple gigabytes of memory write bandwidth. And sometimes one of those kernels would get delayed or slowed enough to time out. The problem only gets worse with more instances.

Initially I considered a bunch of ideas for how to clear this memory in a more targeted way. We might clear only the memory for microphones that are actually in use. But then we have to track which microphones are live. And also, the way the memory is strided, it's not all that clear that this would help, because we'd still be touching a huge swath of memory.

I came up with a number of other schemes of increasing complexity, which was unsatisfying because complexity is basically my #1 enemy at the moment. Almost all the bugs I'm wrangling at this point have to do with things being so complex that there were corner-cases that I didn't handle.

At this point you might be asking yourself: why does all this memory need to be cleared, anyway? That's a good question, which I should have asked earlier. The simple answer is that if a new delay line is created, we want to make sure that the audio samples it reads are silent in the case that they haven't been written yet by their associated microphone. For example, at startup.

But then that raises the question: couldn't we just avoid reading those audio samples somehow? For example, by storing information about the oldest sample number for which the data in a given sample stream is valid, and consulting that low-watermark before reading the samples.

The answer is yes, we could do that instead. And in a massive face-palm moment, I realized that I had already implemented this timestamp for microphones. So in other words, the memory clearing was completely unnecessary, because the GPU code was already keeping track of the oldest valid audio sample for each stream. I think what happened is that I wrote the buffer-clearing code before the low-watermark code, and forgot to remove the buffer-clearing code. And then forgot that I wrote the low-watermark code.

Well, that's not quite the whole story. In addition to the 50 microphone streams, there are 2 streams to represent the stereo external audio input, which can also be tapped by delay lines (to inject audio into the system as an effect processor). This data did not have a low-watermark, and thus the clearing was important.

However for external audio, a low-watermark is much simpler: it's just sample number 0. This is because external audio is copied into the GPU buffer on every block, and so it never has gaps. The Microphone streams can have gaps, because a Microphone can be deleted and re-added, etc. But for external audio, the GPU code just needs to check that it's not reading anything prior to sample 0, and after that it can always assume the data is valid.

Thus ultimately the fix here was to just add 2 lines of GPU code to check the buffer access for external audio streams, and then to delete a couple hundred lines of CPU/GPU code responsible for clearing the internal buffer, marking it as dirty, etc. This resulted in a noticeable speedup for loading Anukari and completely solved the issue of unreliable initialization in the presence of multiple instances.

Pre-alpha release 0.0.13

With the last reliability bug (that I know of) solved, I was finally able to cut a new pre-alpha release this Friday. I'm super stoked about this release. It has a huge number of crash fixes, bug fixes, and usability enhancements. It also turned out to be the right time to add a few physics features that I felt were necessary before the full release. The details of what's in 0.0.13 are in the release notes and in older devlog entries, so I won't go into them here, but this release is looking pretty dang good.

The next two big things on my radar are AAX support and more factory presets. On the side I've been working to get the AAX certificates, etc., needed to release an AAX plugin, and I think that it should be pretty straightforward to get this working (famous last words). And for factory presets, I have about 50 right now but would like to release with a couple hundred. This is especially important now that I've added AnukariEffect, since only a couple of the current presets are audio effects -- most of them are instruments. So I'm kind of starting from scratch there. I think it's pretty vital to have a really great library of factory presets for both instruments and effects, and also, working on them is a great way to find issues with the plugin.

Loading...

FacebookInstagramTikTokBlueskyTwitterDiscordYouTubeAnukari NewsletterAnukari Newsletter RSS Feed
© 2025 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.