Huge macOS performance improvements
Captain's Log: Stardate 78871.1
Yesterday I pushed out a 0.9.6 pre-release version of Anukari that has some extremely promising performance improvements for macOS. On my local test machines, it's working flawlessly, but I'm hesitant to declare victory until I hear back from a few more users.
A big thank you to Apple
I am not an Apple shill. Some of my past devlog entries show some, shall we say, "mild frustration" with Apple.
But I want to give credit where credit is due, and Apple has been incredibly over-the-top helpful with Anukari's performance problems.
The specific engineers I've been speaking with have been slammed with prep work for WWDC, and I know what that's like from working at Google when I/O is coming up. Yet they still have spent an inordinate amount of time talking to me, answering questions, etc. So: Apple folks, you know who you are, and thank you!
The especially good news here is that Apple isn't just spending this time with me to make Anukari work well, but is using this as an opportunity to improve the Metal APIs for all similar latency-sensitive use cases. I got lucky that I was in the right place at the right time, and Apple saw an opportunity to work super closely with an outside developer who cared a LOT about latency.
Lower-latency kernel launches and waits
Based on Apple's help up to this point, the GPU performance state problems I was having are no longer an issue. But there were still some cases where I was not satisfied with Anukari's macOS performance.
Using Metal's GPU timestamps, I came to the conclusion that while the actual runtime of the kernel is pretty stable now. The GPU is in a consistent performance state, and the GPUEndTime - GPUStartTime duration is also consistent.
However, looking at the end-to-end duration of encoding the MTLCommandBuffer, to telling the kernel to launch (MTLCommandBuffer commit), to receiving the results back on the CPU (MTLCommandBuffer waitUntilCompleted), it was a bit longer than I'd like, but more importantly it was really inconsistent. There was a lot of jitter.
I spoke with an Apple engineer, and they suggested that I try using MTLSharedEvent to trigger both the launch of the kernel, as well as to wait for the result. Basically the idea was to do the encoding work for the next command buffer on the CPU while waiting for the previous one to finish. The next buffer would include a command to block for an event before starting the kernel, and a command to signal another event after it finished.
The part about doing the encoding work while waiting for the previous buffer is a no-brainer. The encoding work takes around 50us and all of that can be saved by doing it in parallel with the GPU work. I had considered this before, but at that time 50us was not worth the effort. Now, though, I've cut out so much other slack that it was worth it. There was a bit of complexity having to do with parameters for the kernel that are unknown at the time of encoding -- these had to move to buffers that get written later when the parameters become known. But overall this was pretty straightforward.
However the part about using an MTLSharedEvent waitUntilSignaledValue for the CPU to block on kernel completion didn't seem as obvious to me. Using MTLCommandBuffer waitUntilCompleted seemed like basically the same thing to me. But implementing this was even easier than the command double-buffering, so of course I tried it out. And I'm glad I did, because as Apple predicted, it had much lower latency. Clearly the OS/firmware services these two blocking calls in different ways, and for whatever behind-the-scenes reason, the MTLSharedEvent version works way better.
So I would definitely recommend to anyone trying to achieve super low-latency kernel launches: use MTLSharedEvent both to start the kernel and to wait for it to finish, and use double-buffering to prepare each command buffer on the CPU while the previous one is running on the GPU. It makes a big difference. I am now seeing < 50us of scheduling/waiting overhead.
But nothing is that simple
After making all these changes to improve Anukari's macOS performance, I went back to the CUDA backend on Windows and found that I had severely slowed it down. About 90% of the GPU code is shared between Metal/CUDA/OpenCL, so when I change things it often affects performance on all three backends in different ways. It's a bit like putting a fitted sheet on a bed -- you get one corner on, but then when you pull the next corner on, the previous corner slips back off.
(As an aside, this is the reason Anukari does not yet support Linux. It's a lot of work to deal with the fitted sheet issue for two platforms already. Linux support will come after Windows and macOS are super performant and stable.)
After some git bisecting, I found that the CUDA performance regression was caused by moving some of the kernel's parameters to a device memory buffer. This was part of moving the Metal implementation to encode the next command in parallel with the current one: parameters like the audio block size that aren't known until the next audio block is being processed can't be kernel parameters, but rather have to be written to device memory later, after the kernel has been committed but before the MTLSharedEvent is signaled to start kernel execution.
The kernel parameters in question are tiny, perhaps 64 bytes in total. On macOS this buffer is marked as immutable, and I saw zero performance degradation from having the kernel read it from device memory as opposed to receiving it as arguments.
However on CUDA, there was a huge performance loss from this change. Reading these 64 bytes of parameters from __constant__ device memory in every kernel thread caused the overall kernel to run 20-30% slower, as compared to passing the data directly as kernel parameters.
I ruled out the extra cuMemcpyHtoDAsync call for the new memory segment causing the increased latency. Careful timing showed that the extra time really was really in the kernel execution.
I don't know why this is so much slower, but I have a hypothesis, which is that when the constant data is passed via a kernel parameter, rather than __constant__ device memory, CUDA is capable of doing some kind of inlining that is otherwise not possible. For example, maybe it can rewrite some of the kernel instructions to use immediate operands instead of registers. Or possibly it is actually unrolling some loops at runtime. (Something to experiment with.)
Anyway the solution here was simply to bifurcate the GPU backends a little bit and have the CUDA backend pass the arguments directly to the kernel, rather than using a __constant__ buffer. This works just fine for now, but I may eventually need to find an alternative if I want to apply the same background encoding concepts that I've been successful with on Metal to the CUDA backend.
For now, things are fast again and I'm going to move on to some other work before taking a further pass at improving the CUDA backend. My findings here do suggest that there may be other CUDA optimizations possible by trying to move more data into direct kernel parameters instead of __constant__ buffers. Given the performance difference, I will definitely experiment with this.
And hey, some new features!
While performance is still my main focus, I did take a couple breaks from that to implement some new features.
The one that I'm by far the most excited about is that Anukari's modulation system now has a target called MIDI Note Trigger, which allows any object that can be triggered via MIDI notes to also/instead be triggered through modulation.
The way it works is that when a modulation signal goes from below 0.5 to above 0.5, that's a note on event. And when it drops from above 0.5 to below 0.5, that's a note off.
This is extremely simple, but it opens up a gigantic range of possibilities. LFOs can now trigger envelopes. Envelope triggers can now be used to delay note on events to create arpeggiator effects. Mallets can be triggered at insane speeds. Envelope followers can be used to trigger oscillators when an audio input signal gets loud enough. The list goes on.
Apple performance progress
Captain's Log: Stardate 78825.2
Since my last update, I've spent all my time trying to take advantage of the insights that the engineers on the Apple Metal team gave me with regards to how to convince the OS to increase the performance state such that Anukari will run super smoothly.
A big piece of news is that I bit the bullet and bought a used Macbook Pro M4 Max so that I can iterate on testing these changes much more quickly, and verify that the same tuning parameters work well on both my wimpy M1 Pro and my beastly M4 Max.
Things are working MUCH better now on the M4 Max. At least for me, I can easily run 6 plugin instances in Logic Pro with no cracking at all (512 buffer, 48 kHz). That said, there's still a weird issue that I've seen on my M1 Pro where sometimes a plugin instance will get "stuck" performing badly, and I don't know why. Simply changing the simulation backend to OpenCL and back to Metal fixes it, so it's something stateful in the simulator. More work to do!
Secret NDA stuff
One thing I've been doing is applying the details that Apple shared under NDA to tickle the OS performance state heuristics in just the right way. The nice thing here is that knowing a tiny bit more about the heuristics, I'm able to be much more efficient and effective with how I tickle them. This means that the spin loop kernel uses less GPU resources to get the same result, and thus it interferes less with the audio kernel, and uses less power. I think this is as much detail as I can go into here, but honestly it's not that interesting, just necessary.
Spin loop serialization
Separately from the NDA stuff, I ran into another issue that was likely partly to blame for the horrible performance on the most powerful machines.
The Apple API docs for MTLCommandBuffer/commit say, "The GPU starts the command buffer after it starts any command buffers that are ahead of it in the same command queue." Notice that it doesn't say that it waits for any command buffers ahead of it to finish. From what I can tell, Metal actually is quite anxious to run these buffers in parallel if it thinks that it's safe to do so from a data dependency perspective, and of course if there are cores available.
The spin loop kernel runs for a very short duration, and thus there's a CPU thread that's constantly enqueuing new command buffers with blocks of spin kernel invocations.
On my M1 Pro, it seems that the spin kernels that were queued up in these buffers ran in serial, or at least mostly in serial. But on my new M4 Max, it appears that the Metal API is extremely aggressive about running them in parallel, to the extent that it would slow down other GPU work during brief bursts of "run a zillion spin kernels at once."
The solution to this was very simple, and that was simply to use an MTLEvent to guarantee that the queued spin kernels run in serial.
Getting the spin loop out of the audio thread
The original design for the spin loop kernel was that one of the audio threads would take ownership of it, and that thread was responsible for feeding in the buffers of spin kernels to run. There's never any need to wait for the results of a spin kernel, so the audio thread was only paying the cost of encoding command buffers.
I built it this way out of sheer convenience — the audio thread had a MTLDevice and MTLLibrary handy, so it was easy to put the spin loop there. Also, errors in the spin loop were exposed to the rest of the simulator machinery which allows for retries, etc.
But that's not a great design. First, even if the encoding overhead is small, it's still overhead that's simply not necessary to force upon the audio thread. And because MTLCommandQueue will block if it's full, bugs here could have disastrous results.
So finally I moved the spin kernel tender out of the audio thread and into its own thread. The audio threads collectively use a reference-counting scheme to make sure a single spin loop CPU thread is running while any number of simulation threads are running. The spin loop thread is responsible for its own retries. This removes any risk of added latency inside the audio threads.
Inter-process spin loop deduplication
While I was rewriting the spin loop infrastructure, I decided to fix another significant issue that's been there since the beginning.
Anukari has always made sure that only a single spin loop was being run at a time by using static variables to manage it. So for example, if the user has 5 instances of Anukari running in their DAW, only one spin loop will be running.
But this has problems. For example, because Anukari and AnukariEffect are separate DLLs, they have separate static storage. That means that if a user has both plugins running in their DAW, that two spin loops would be run, because the separate DLLs don't communicate. This is bad, because any additional spin loops are wasting resources that could be used for the simulations!
Even worse, though, are DAWs that do things like run plugins in a separate process. Depending on how this is configured, it might mean that many spin loops are running at once, because each Anukari instance is in a separate process.
In the past I had looked at ways to do some kind of really simple IPC to coordinate across processes, and also across DLLs within the same process. But none of the approaches I found were workable:
- Apple's XPC: It's super heavyweight, requiring Anukari to start a spin loop service that all the plugins communicate with.
- POSIX semaphores: This would work except the tiny detail that they aren't automatically cleaned up when a process crashes, so a crash could leave things in a bad state that's not recoverable without a reboot.
- File locks (F_SETLK, O_EXLOCK): Almost perfect, except that locks are per-process, not per- file descriptor. So this would not solve the issue with the two DLLs needing to coordinate: they would both be able to acquire the lock concurrently since they're in the same process.
- File descriptor locks (F_OFD_SETLK): Perfect. Except that macOS doesn't support them.
Today, though, I found a simple approach and implemented it, and it's working perfectly, so that in the 0.9.5 release of Anukari, there is guaranteed to only ever be a single spin loop thread globally, across all processes.
It turns out the simplest way to do this was to mmap() a shared memory segment from shm_open(), and then to do the inter-process coordination using atomic variables inside that segment. I already had the atomic-based coordination scheme implemented from my original spin system, so I just transplanted that to operate on the shared memory segment, and voila, things work perfectly. The atomic-based coordination is described in the "Less Waste Still Makes Haste" section of this post.
The nice thing with this scheme is that because the coordination is based on the spin loop CPU thread updating an atomic keepalive timestamp, cleanup is not required if a process crashes or exits weirdly. The keepalive will simply go stale and another thread will take ownership, using atomic instructions to guarantee that only one thread can become the new owner.
Had a super productive conversation with an Apple Metal engineer
Well, my An Appeal to Apple post got way more traction than I had ever hoped for, especially on Hacker News but also on LinkedIn and also just generally with friends who went out of their way to help me out. So first, thank you everyone, for helping me make contact with Apple. It worked!
I just got off a call with an engineer on the Apple Metal team, and the conversation was extremely friendly and helpful.
Unfortunately I can't share much in the way of crazy technical detail, because it turns out that when I created my Apple Developer account, I signed a telephone book-sized NDA with Apple. Obviously I enjoy writing about technical stuff, so it's a bit sad that I can't do so here, but at the same time the fact that I had pre-signed an NDA allowed the Metal engineer to open up and share some extremely helpful information. So overall I can't complain.
While I can't share any technical details, I can say that Apple was very sympathetic to the Anukari cause. The engineer provided some suggestions and hints that I can use right now to maybe — just maybe — get things working in the short term. But just as importantly they already have long-term plans to make more kinds of weirdo use cases like Anukari work better, while still being super power efficient. And they'll be able to use Anukari as a test-bed for some of that work.
I now have an open line of communication with the right folks at Apple, which is great. As I work on short-term performance improvements based on the hints they provided, I plan to run some ideas by them in terms of what I might be able to write about in detail without spilling any beans that need to remain un-spilled.
Thanks again, everyone!