Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Apple performance progress

Captain's Log: Stardate 78825.2

Since my last update, I've spent all my time trying to take advantage of the insights that the engineers on the Apple Metal team gave me with regards to how to convince the OS to increase the performance state such that Anukari will run super smoothly.

A big piece of news is that I bit the bullet and bought a used Macbook Pro M4 Max so that I can iterate on testing these changes much more quickly, and verify that the same tuning parameters work well on both my wimpy M1 Pro and my beastly M4 Max.

Things are working MUCH better now on the M4 Max. At least for me, I can easily run 6 plugin instances in Logic Pro with no cracking at all (512 buffer, 48 kHz). That said, there's still a weird issue that I've seen on my M1 Pro where sometimes a plugin instance will get "stuck" performing badly, and I don't know why. Simply changing the simulation backend to OpenCL and back to Metal fixes it, so it's something stateful in the simulator. More work to do!

Secret NDA stuff

One thing I've been doing is applying the details that Apple shared under NDA to tickle the OS performance state heuristics in just the right way. The nice thing here is that knowing a tiny bit more about the heuristics, I'm able to be much more efficient and effective with how I tickle them. This means that the spin loop kernel uses less GPU resources to get the same result, and thus it interferes less with the audio kernel, and uses less power. I think this is as much detail as I can go into here, but honestly it's not that interesting, just necessary.

Spin loop serialization

Separately from the NDA stuff, I ran into another issue that was likely partly to blame for the horrible performance on the most powerful machines.

The Apple API docs for MTLCommandBuffer/commit say, "The GPU starts the command buffer after it starts any command buffers that are ahead of it in the same command queue." Notice that it doesn't say that it waits for any command buffers ahead of it to finish. From what I can tell, Metal actually is quite anxious to run these buffers in parallel if it thinks that it's safe to do so from a data dependency perspective, and of course if there are cores available.

The spin loop kernel runs for a very short duration, and thus there's a CPU thread that's constantly enqueuing new command buffers with blocks of spin kernel invocations.

On my M1 Pro, it seems that the spin kernels that were queued up in these buffers ran in serial, or at least mostly in serial. But on my new M4 Max, it appears that the Metal API is extremely aggressive about running them in parallel, to the extent that it would slow down other GPU work during brief bursts of "run a zillion spin kernels at once."

The solution to this was very simple, and that was simply to use an MTLEvent to guarantee that the queued spin kernels run in serial.

Getting the spin loop out of the audio thread

The original design for the spin loop kernel was that one of the audio threads would take ownership of it, and that thread was responsible for feeding in the buffers of spin kernels to run. There's never any need to wait for the results of a spin kernel, so the audio thread was only paying the cost of encoding command buffers.

I built it this way out of sheer convenience — the audio thread had a MTLDevice and MTLLibrary handy, so it was easy to put the spin loop there. Also, errors in the spin loop were exposed to the rest of the simulator machinery which allows for retries, etc.

But that's not a great design. First, even if the encoding overhead is small, it's still overhead that's simply not necessary to force upon the audio thread. And because MTLCommandQueue will block if it's full, bugs here could have disastrous results.

So finally I moved the spin kernel tender out of the audio thread and into its own thread. The audio threads collectively use a reference-counting scheme to make sure a single spin loop CPU thread is running while any number of simulation threads are running. The spin loop thread is responsible for its own retries. This removes any risk of added latency inside the audio threads.

Inter-process spin loop deduplication

While I was rewriting the spin loop infrastructure, I decided to fix another significant issue that's been there since the beginning.

Anukari has always made sure that only a single spin loop was being run at a time by using static variables to manage it. So for example, if the user has 5 instances of Anukari running in their DAW, only one spin loop will be running.

But this has problems. For example, because Anukari and AnukariEffect are separate DLLs, they have separate static storage. That means that if a user has both plugins running in their DAW, that two spin loops would be run, because the separate DLLs don't communicate. This is bad, because any additional spin loops are wasting resources that could be used for the simulations!

Even worse, though, are DAWs that do things like run plugins in a separate process. Depending on how this is configured, it might mean that many spin loops are running at once, because each Anukari instance is in a separate process.

In the past I had looked at ways to do some kind of really simple IPC to coordinate across processes, and also across DLLs within the same process. But none of the approaches I found were workable:

  • Apple's XPC: It's super heavyweight, requiring Anukari to start a spin loop service that all the plugins communicate with.
  • POSIX semaphores: This would work except the tiny detail that they aren't automatically cleaned up when a process crashes, so a crash could leave things in a bad state that's not recoverable without a reboot.
  • File locks (F_SETLK, O_EXLOCK): Almost perfect, except that locks are per-process, not per- file descriptor. So this would not solve the issue with the two DLLs needing to coordinate: they would both be able to acquire the lock concurrently since they're in the same process.
  • File descriptor locks (F_OFD_SETLK): Perfect. Except that macOS doesn't support them.

Today, though, I found a simple approach and implemented it, and it's working perfectly, so that in the 0.9.5 release of Anukari, there is guaranteed to only ever be a single spin loop thread globally, across all processes.

It turns out the simplest way to do this was to mmap() a shared memory segment from shm_open(), and then to do the inter-process coordination using atomic variables inside that segment. I already had the atomic-based coordination scheme implemented from my original spin system, so I just transplanted that to operate on the shared memory segment, and voila, things work perfectly. The atomic-based coordination is described in the "Less Waste Still Makes Haste" section of this post.

The nice thing with this scheme is that because the coordination is based on the spin loop CPU thread updating an atomic keepalive timestamp, cleanup is not required if a process crashes or exits weirdly. The keepalive will simply go stale and another thread will take ownership, using atomic instructions to guarantee that only one thread can become the new owner.

Had a super productive conversation with an Apple Metal engineer

Well, my An Appeal to Apple post got way more traction than I had ever hoped for, especially on Hacker News but also on LinkedIn and also just generally with friends who went out of their way to help me out. So first, thank you everyone, for helping me make contact with Apple. It worked!

I just got off a call with an engineer on the Apple Metal team, and the conversation was extremely friendly and helpful.

Unfortunately I can't share much in the way of crazy technical detail, because it turns out that when I created my Apple Developer account, I signed a telephone book-sized NDA with Apple. Obviously I enjoy writing about technical stuff, so it's a bit sad that I can't do so here, but at the same time the fact that I had pre-signed an NDA allowed the Metal engineer to open up and share some extremely helpful information. So overall I can't complain.

While I can't share any technical details, I can say that Apple was very sympathetic to the Anukari cause. The engineer provided some suggestions and hints that I can use right now to maybe — just maybe — get things working in the short term. But just as importantly they already have long-term plans to make more kinds of weirdo use cases like Anukari work better, while still being super power efficient. And they'll be able to use Anukari as a test-bed for some of that work.

I now have an open line of communication with the right folks at Apple, which is great. As I work on short-term performance improvements based on the hints they provided, I plan to run some ideas by them in terms of what I might be able to write about in detail without spilling any beans that need to remain un-spilled.

Thanks again, everyone!

An Appeal to Apple from Anukari

[EDIT]: Apple got in touch! Details here.

TL;DR: To make Anukari’s performance reliable across all Apple silicon macOS devices, I need to talk to someone on the Apple Metal team. It would be great if someone can connect me with the right person inside Apple, or direct them to my feedback request FB17475838 as well as this devlog entry.

This is going to be a VERY LONG HIGHLY TECHNICAL post, so either buckle your seatbelt or leave while you still can.

Background

The Anukari 3D Physics Synthesizer simulates a large spring-mass model in real-time for audio generation. To support a nontrivial number of physics objects, it requires a GPU for the simulation. The physics code is ALU-bound, not memory-bound. All mutable state in the simulation is stored in the GPU’s threadgroup memory, which is roughly equivalent to a manually-allocated L1 cache, so it is extremely fast.

The typical use-case for Anukari is running it as an AudioUnit (AU) or VST3 plugin inside a host application like Pro Tools or Ableton, also called a Digital Audio Workstation (DAW). The DAW invokes Anukari for each audio buffer block, which is a request to generate/process N samples of audio. For each block, Anukari invokes the physics simulation GPU kernel, waits for the result, and returns.

The audio buffer block system is important because GPU kernel scheduling has a certain amount of latency overhead, and for real-time audio we have fixed time constraints. By amortizing the GPU scheduling latency over, say, 512 audio samples, it becomes negligible. But the runtime of the kernel itself is still very important.

Basic Problem

Apple’s macOS is obviously extremely clever about power management, and Apple silicon hardware is built to support the OS in achieving high power efficiency.

As with all modern hardware, the clock rate for Apple silicon chips can be slowed down to reduce power consumption. When the OS detects that the processing demand for a given chip is low (or non-existent), it can decrease the clock rate for that chip. This is awesome.

The problem is that due to the way Anukari runs inside a DAW and interacts with the GPU, the heuristics that macOS uses to determine whether there is sufficient demand upon the GPU to increase its clock rate do not work.

Consider the chart below. The CPU does some preparatory work, there’s a small gap which represents kernel invocation latency, and then the GPU does a large block of work. Finally there’s another small gap representing the real-time headroom.

(An aside: chalkboards are way better than whiteboards, unless you enjoy getting high on noxious fumes. in which case whiteboards are the way to go.)

I don’t have any real knowledge of macOS’s heuristics for deciding when to increase the GPU clock speed, but I might reasonably guess that it relies on something like the load average. In the diagram above, the GPU load average might be only 60%, because between audio buffer blocks it is idle. Perhaps this does not meet the threshold for increasing the GPU clock rate.

But this is terrible for Anukari, because to meet real-time constraints, it needs the absolute lowest latency possible, which requires the highest GPU clock rate. I’m not sure how low the Apple GPU clock rate can go, but it definitely goes low enough to make Anukari unusable.

To be clear, it’s pretty understandable that macOS handles this situation poorly, because the GPU is mostly used for throughput workflows like graphics or ML. Audio on the GPU is really new, and there are only a couple of companies doing it right now.

Are you sure the clock rate is the problem?

Oh yes. Thankfully, Apple’s first-part Instruments tools that come with Xcode have a handy Metal profiler. Among other things, this is how I first learned that Anukari is ALU-bound.

The Metal profiler has an incredibly useful feature: it allows you to choose the Metal “Performance State” while profiling the application. This is not configurable outside of the profiler. This is how I first figured out that the GPU clock rate was the issue: Anukari works perfectly under the Maximum performance state, and abysmally under the Minimum performance state.

Wait, Anukari mostly works great on macOS. How is that possible?

Given the explanation above, this is a great question. Most people with macOS find that Anukari works great. For example, I intentionally bought a base-model Macbook M1 for development, so that if Anukari worked well for me, I’d know it worked well for people with beefier hardware.

So how is it that Anukari works great on macOS for most people? Well, as Steve Jobs said, “It’s better to be a pirate than to join the Navy.”

Without being able to rely on macOS to do the right thing, I thought like a pirate and came up with a workaround: in parallel with the audio computation on the GPU, Anukari runs a second workload on the GPU that is designed to create a high load average and trick macOS into clocking up the GPU. This workload is tuned to use as little of the GPU as possible, while still creating a big enough artificial load to trigger the clock heuristics.

In other words, it runs a spin loop to heat up the GPU. On my Macbook M1, this completely solves the issue. Anukari runs completely reliably. I called this strategy “waste makes haste” and it is documented in detail on my devlog here.

To be clear, the spin loop is an unholy abomination and I hate that it’s necessary. But it is absolutely required for Anukari to work well for macOS users. And for most macOS users, it works great.

So with the “waste makes haste” strategy, what’s the problem?

Like I said, on my M1 things work perfectly. But then I released the Anukari Beta and some macOS users are having problems. What’s different?

First: I’m not certain. But I have a couple hypotheses.

Weirdly, it appears that most users with performance issues are using Pro or Max Apple hardware. These have additional GPU chiplets. I am completely speculating here, but Apple’s hardware is amazing so it stands to reason that each GPU chiplets's clock rate can be changed independently. Do you see where this is going?

If macOS is really smart, it should see that Anukari is running two independent GPU workloads: the physics kernel, and the spin kernel. Why not run those on separate GPU chiplets? That’s smart. And then, if I’m right that the GPU chiplets have independent clock rates, well, Anukari is hosed because the stupid spin workload will get a fast-clock GPU and the audio workload will get a slow-clock GPU.

I could be wrong. Another possibility is that the GPU has a single clock rate, and my spin workload is too conservative to convince the much more powerful GPU to clock up. Maybe my spin kernel can heat up an M1 GPU but not an M4 Pro GPU, because the M4 Pro is faster.

What do you think macOS should do to fix this?

Apple engineers will obviously know better than I do here, but I’ll present a couple of obvious possibilities.

Solution 1: On macOS, audio processing is done on a thread (or group of threads) called an Audio Workgroup. These are explained in Apple’s documentation here. Within an Audio Workgroup, the OS understands that the threads have real-time constraints, and prioritizes those threads appropriately. This is actually a fantastic innovation, because before Audio Workgroups, it wasn’t really possible to do real-time audio processing safely across multiple threads without problems like priority inversion, etc.

The Audio Workgroup concept could be extended to cover processing on the GPU. Any MTLCommandQueue managed by an Audio Workgroup thread could be treated as real-time and the GPU clock could be adjusted accordingly.

Solution 2: The Metal API could simply provide an option on MTLCommandQueue to indicate that it is real-time sensitive, and the clock for the GPU chiplet handling that queue could be adjusted accordingly.

Solution 3: Someone could point out that I'm an idiot and there's already some way to get what I want, and this whole post was a waste of my time. This would be wonderful.

Wouldn’t Apple’s new Game Mode help?

Game Mode certainly seems similar to what Anukari needs. However, Game Mode is at the process-level, and Anukari is mostly used as a plugin inside other processes, which don’t support Game Mode, and anyway Anukari has no control. Also Anukari is usually not fullscreen, which Game Mode requires.

What about Windows performance?

Not a problem at all. I don’t know if it’s because Windows gives users more control over their system’s performance state, or if e.g. NVIDIA drivers are less careful about power consumption, or what. But the spin loop is not necessary on Windows.

It's not a great look for Apple that a Windows PC with a pretty wimpy GPU can run Anukari just fine, and the most expensive Mac M4 Max stutters, because obviously Apple's hardware is incredible and just needs to be let off the leash a bit.

Why don’t you just pipeline the GPU code so that it saturates the GPU?

For a throughput workload, this is exactly what I’d do. But Anukari is not throughput-sensitive, it is latency-sensitive.

The idea with pipelining is that maybe Anukari could schedule multiple physics simulation kernels in advance, so that the GPU could be processing the current audio sample block at the same time that the CPU is preparing the next block for the GPU. (This would also alleviate the kernel invocation latency overhead, but that’s not as important.)

But anyone who understands pipelining knows that it increases throughput at the cost of latency. Anukari processes audio in real-time, so each kernel invocation needs access to the real-time audio input data (e.g. from the microphone). Thus Anukari can’t do something like speculative execution where it processes the next audio block early, because it wouldn’t have the input data required to do so.

Why don’t you run the spin kernel in the same MTLCommandQueue as the physics kernel?

This solution might fix the problem where the spin and physics kernels end up running on separate GPU chiplets (if that is indeed the issue).

I have tried this, actually. The reason it doesn’t work is again because Anukari is latency-sensitive. What happens is that sometimes the spin kernel runs a little too long and cuts into the time for running the physics kernel. I experimented with small spin kernels, using volatile unified memory to allow the CPU to write an “exit kernel early” flag. Even with these hijinx, sometimes the spin kernel cut into physics kernel time.

Why not run a bunch of copies of the physics kernel and return the fastest result?

In distributed systems, such as databases, this is known as “request hedging,” and it can reduce tail latency and latency variance quite a bit. And in the case of Anukari, the extra copies, or hedges, would conceivably serve the purpose of providing load on the GPU to get the OS to increase the performance state.

We’ll call this GPU kernel hedging, and it would look something like the chart below:

In this example, for each audio block, 3 copies of the physics kernel are launched. For the first audio block, the result from GPU1 would be used because it is the first kernel to finish. For the second audio block, the result from GPU2 would be used. In both cases, the audio data would be returned to the host application before the other copies of the kernel finish.

Unfortunately there are a couple of reasons that this won’t work for Anukari.

The main issue has to do with what happens when one of the physics kernels takes more than one full audio block cycle to complete. In the diagram above, this happens with GPU3: it has not completed by the time the second audio block is requested. In the diagram, all the kernels started executing at the same time, but in reality it’s possible that the slowest kernels will be ones that the OS took a long time to schedule. It’s entirely possible that one of the hedged kernel streams will fall behind by several audio blocks.

So for a future audio block, somehow we will need to “catch up” any kernel streams that fell behind. This requires some kind of fast-forward, since we can’t wait for them to catch themselves up. To do that, we can simply copy the internal state from another kernel stream that is already caught up.

The problem is that this copy is expensive. The largest internal state that Anukari stores is the audio buffer used for delay lines. It stores 1 second of historical audio for every mic. This buffer’s size is 48,000 audio samples * 50 mics * 2 channels (L/R) * 16 voices * 4 bytes per float = 307 MB. And of course it’s larger at higher sample rates.

To do this efficiently we’d need to keep precise records of what parts of the sample buffer are “dirty” for each hedged kernel stream, and copy only the dirty parts. In practice, though, this will not be efficient because the memory layout for this buffer is heavily optimized to be fast for the particular read workload that the physics kernel performs. This means that even if the fast-forward copy was precise about copying only the minimum amount of data, it would be scattered about across the entire buffer. This copy will be slow. Furthermore, the internal delay line audio buffer is just one of several buffers that would need to be synchronized.

Another issue is that whenever the user changes the model, those changes would need to be fanned out to all the hedged kernels.

The final problem here is that the physics kernel has a substantially larger GPU footprint than the “waste makes haste” spin kernel, which is designed to have the smallest footprint that can reliably warm up the GPU. So kernel hedging would impose more wasted load on the GPU, reducing the number of Anukari instances that can be run in parallel, or even worse, causing competition between hedge kernels that leads to them all running more slowly.

Why not just make the GPU code more efficient?

Because the simulation is ALU-bound, there’s not much performance to be gained from the typical optimization low-hanging fruit: improving memory access patterns. To speed up Anukari’s physics kernel, the only thing that really helps is optimizing arithmetic throughput. For example, Anukari uses FP16 math where possible to better-saturate Apple’s ALUs. Instructions have been reordered using micro-benchmarks. All physics state is in L1 memory. Loads are reordered for vectorization. The list goes on.

Furthermore, Anukari heavily exploits the fact that threads within Apple’s SIMD-groups (roughly) share an instruction pointer. Different physics objects have highly divergent code branch paths, so simulating two types of objects within a SIMD-group is slow due to the requirement for instruction masking while the top-level branch paths are executed serially. To avoid this, Anukari dynamically optimizes the memory layout of physics objects to minimize the number of object types executed within each SIMD-group. This optimization is described here in great detail, and is a massive performance win.

What I'm trying to say is that I have bent over backwards to squeeze every last drop of performance out of Apple's hardware. It's been fun to do!

There are further arithmetic optimizations to be done, but they will be fairly marginal. We’re talking single-digit percentage point speedups. Anukari’s GPU code is already VERY FAST. If you’re curious, there’s more info on Anukari’s optimizations here.

Why use the GPU at all?

On powerful machines, Anukari can simulate 768 - 1024 physics objects. Each object can be arbitrarily connected to other objects, meaning that they influence one another. Each object has to be stepped forward in an implicit Euler integration at the audio sample rate, typically 48,000 samples per second. Each object has somewhere between 3 and 10 parameters that affect its behavior. Some of the behaviors involve expensive math, like vector rotation, exp(), log(), etc. Oh, and on top of that all, Anukari runs up to 16 entire copies of the physics simulation in parallel for polyphony!

This is simply not even close to feasible on the CPU! Trust me, I tried. It’s not even a little bit close. The GPU just has a ridiculous number of ALUs, it gives me explicit control over the L1 cache layout, and the concurrency constructs like threadgroup_barrier allow the physics integration steps to be done massively in parallel without consistency issues, without expensive CPU mutexes.

I’ll say it again: Anukari does not exist without GPU processing.

Why should Apple care about what Anukari needs?

Maybe they shouldn’t, I don’t know. Anukari is a tiny startup. It’s a niche product. It’s doing something weird.

On the other hand, the people who like Anukari really like it. Mick Gordon, the composer for IMO the greatest DOOM games of all time, randomly showed up and blew everyone away with an incredible demo using Anukari. Anukari is receiving praise from people who really love synthesizers like CDM. Comments have appeared on random internet threads that I didn't start, like, "it's the most creative plugin I've tried in the last 10 years.

But mostly, Anukari is using Apple’s hardware in what I consider an incredibly cool way, allowing users to do something that they’ve never been able to do before. And Apple’s hardware is completely up to the task. But it just needs a little push in the right direction.

Why don’t you use GPU Audio’s APIs?

I don’t really want to include this last question, but it is here to address the fact that the CEO of GPU Audio, Alexander Talashov, likes to drive by threads about Anukari and suggest that if Anukari used his APIs it would solve our problems. I would be so happy if this were true.

Alexander is a great guy, and his product (GPU Audio) is a great product. I’ve met him in person, and he’s incredibly passionate about making the GPU accessible for DSP. Audio folks should check it out, it’s really cool. I wish GPU Audio huge success, and support their product.

But… GPU Audio has nothing useful for Anukari. Fundamentally the problem is that Anukari is not anything like a traditional DSP application. It’s a numerical differential equation integrator, far more similar to a video game physics engine than a DSP application. Sure, there are bits and bobs of DSP inside the physics engine, for example Anukari’s mics in the physics world do have compression, and that’s done in-line with the physics calculations on the GPU. But the vast majority of computation is Eulerian integration.

It needs to be understood that I’m programming the GPU at the bare Metal layer (yes pun intended), and am taking advantage of a number of hardware features and ridiculous domain-specific optimizations that are quite necessary to make this work. And all I need is for Apple to reliably turn up the clock rate on the GPU.

Loading...

FacebookInstagramTikTokBlueskyTwitterDiscordYouTubeAnukari NewsletterAnukari Newsletter RSS Feed
© 2025 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.