devlog > optimization

RSS Feed
Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

RAM savings in the 0.9.26 release

Captain's Log: Stardate 79645.8

I finally got the 0.9.26 release out yesterday (download, release notes), and I'm really happy with everything I managed to get into it.

Anukari's RAM use had been bothering me for some time. A kind of funny thing is that I became especially aware of it after I moved the audio processing from the GPU to the CPU.

Delay Line Buffer Optimization

Anukari supports delay lines of up to 1 second in duration. Delay lines run from the virtual microphones to audio input exciters. So each mic needs to store 1 second of audio sample history. Anukari supports up to 50 mics, and to complicate things further, in polyphonic mode, there may be up to 16 instances of each mic. So storing 1 second of 32-bit precision audio data at 48 kHz for 16 instances of 50 mics requires ~154 MB of RAM.

Since the physics simulation is running in a realtime audio thread, it is highly undesirable to do dynamic memory allocation. Generally speaking, memory allocation is a blocking operation, and often will acquire a mutex, and even if it doesn't, it might take a while if memory is fragmented, etc. So for any memory blocks used by the audio thread, at startup Anukari allocates buffers that are of the maximum required size. This eliminates the need for dynamic allocation at runtime, but insofar as those buffers go unused, it wastes memory.

When Anukari's physics simulation ran on the GPU, the 154 MB delay buffer was allocated in GPU memory. This was actually kind of handy, because most audio applications don't use much GPU memory, so in some ways this memory was "free." Of course that doesn't apply to Apple's unified memory architecture. But for many users, this RAM usage was not really visible.

When I moved the simulation to the CPU, 154 MB of RAM usage that was sometimes hidden before became quite visible. Especially for users running many instances of Anukari in parallel, this can add up pretty quickly.

For the 0.9.26 release I decided it was time to do something about this. Most Anukari presets have fewer than 50 mics, don't use all 16 voices of polyphony, and don't have delay lines attached to every single mic. In fact many presets don't even have a single delay line, so this 154 MB of RAM is completely wasted.

Now, could I get away with just dynamically allocating this memory in the audio thread? Probably. The allocation would only happen when the user adds a delay line to an existing preset, increases the polyphony setting, etc. If a small audio glitch happened at that moment it probably wouldn't be super noticeable.

But I've taken a very hard line on this issue. My goal is for Anukari's audio to be absolutely rock solid with no compromises. I'm not claiming I've fully accomplished that, but it's the north star. So that means no allocation on the audio thread, no taking the easy way out.

The solution I came up with is as follows:

  1. There is a background allocator thread.
  2. Delay buffer pointers are atomics.
  3. The audio thread can describe the allocations it needs in a struct that gets pushed into a lock-free SPSC queue that the allocator thread reads.
  4. The allocator thread looks at each new allocation spec, and if required it will allocate the required memory and update the atomic delay buffer pointers.

This allows the memory allocation to happen in a non-realtime thread without ever blocking the audio thread.

There's an incredibly important wrinkle though: when a buffer is no longer needed, somehow we need to free it, or else we'll leak memory. This is a bit subtle, because somehow we need to guarantee that the audio thread will never again access a given buffer before we free it, otherwise invalid memory access will occur and we'll probably crash.

The trick I used to handle deallocation is as follows:

  1. When the allocator thread reallocates or deallocates a buffer, it updates the atomic pointer, and then pushes the old pointer to a local "to-free" queue, in a struct that also contains an epoch integer.
  2. The audio thread and allocator thread share an atomic epoch counter, which the audio thread increments each time it finishes processing an audio block.
  3. The audio thread is allowed to safely access stale delay buffer pointers until it increments the epoch value.
  4. The allocator thread periodically checks the next item in the to-free queue. If the item's epoch is less than the atomic epoch counter, it is safe to free it.
  5. Thus, as long as the audio thread never caches a buffer pointer past the point when it increments the epoch counter, it will never access freed memory.

This may sound a bit complicated, but it's all less than a hundred lines of code. I am very thankful that I have a bunch of thorough fuzz tests to give me confidence that it works. :)

Note that this solution can't immediately be generalized to all memory allocations. Because it is asynchronous, and the allocator thread is not realtime, there will be a period where the audio thread has requested memory but does not yet have it. So this technique only applies to cases where the audio thread can operate without the memory for that duration.

This is fine for delay lines. It just means that after the user adds a new delay line, it may take a moment before the delay line's audio signal begins arriving at its destination. In practice, this duration is imperceptibly small, and doing things this way guarantees realtime safety so there will never be an audio glitch when adding a delay line.

Other RAM Savings

The delay line buffer allocation was the most interesting/difficult RAM optimization, but I did many other optimizations that were a lot simpler.

First, I tightened up the static size for several buffers and queues. There were several instances where I had sized a buffer way too conservatively, and was able to save a few tens of megabytes of RAM this way. Again, here I was relying on my comprehensive fuzz tests (and unit tests) to make sure that I didn't break anything by resizing these buffers.

The largest and most embarrassing savings of all, though, was by evicting assets from Anukari's 3D renderer cache. This was essentially trivial to do and obviously safe, so a bit of a facepalm that I hadn't done it before.

A bit of background: I have spent many hours obsessing over how quickly Anukari loads, both from a cold start as well as when the GUI is closed and re-opened when running as a DAW plugin.

One bottleneck in the cold start was the parsing and initialization of the 3D assets. This includes decoding the files, decompressing them when necessary, building collision data structures, stuff like that. I did some testing and found that parallelizing this work massively reduced the GUI's cold start latency.

The way I parallelized this is as follows:

  1. The GUI enumerates all the assets it needs.
  2. The asset cache then spawns a bunch of threads and loads the assets in parallel, writing the final results to a dictionary cache.
  3. The GUI reads the assets from the cache as they become available.

The memory-wasting problem was that after the GUI finished reading all of the asset data, it never told the asset cache that it was finished. So the asset cache just stuck around for the lifetime of the 3D renderer, wasting memory for no reason.

This was an easy fix! The GUI already knew when it was done with the cache, so I simply had to add a line of code to clear it. This saved 100+ MB of memory for simple skyboxes and 3D assets. But for ultra high-resolution skyboxes and more detailed 3D models, it saves way more memory, potentially a few hundred MB. Pretty nice!

Audio quality improvements

Captain's Log: Stardate 79463

My last couple of posts were about annoying website engineering stuff that I would have preferred to not spend time on. Fortunately while annoying, that wasn't a lot of work, and most of my time has still gone to working on the Anukari software itself.

A couple of weeks ago I released version 0.9.23, which was focused on audio quality improvements (full release notes here). There are also some pretty significant performance improvements, for example instruments with lots of microphones now perform much better, as I rewrote the mic simulation code in pure SIMD using all the tricks I learned with other entity types.

Now that performance is looking really good, I'm really happy that I've had the opportunity to work on the audio quality again. There's more performance work I can do, and I will at some point, but for now I am going to prioritize making the plugin sound better, by improving the existing physics simulation and by adding more audio features.

Master Limiter

One big thing in this release is that I replaced the master limiter, which for some presets could cause slight crackling, and in general, was flattening out the sound in an unpleasant way.

The limiter has a bit of history. Originally there was no limiter, no circuit breaker, and no automatic physics explosion detection. So when the physics system exploded due to crazy parameters, Anukari could make incredibly loud chaotic sounds. My wife and I referred to this as "Evan opened another gate to Hell in his office."

My first solution was the circuit breaker, which monitors the master RMS level and automatically pauses the simulation if a configurable limit is exceeded. This is really helpful when building presets, as it freezes the simulation before things get too chaotic, which allows you to undo whatever change you made that caused things to go haywire, and then go about your work.

Despite the circuit breaker, it was still possible to make really loud noises by accident. Sometimes it is possible to create an instrument that generates a loud sound just below the circuit breaker trip threshold, for example. And sometimes you don't want the circuit breaker on, e.g. while performing you probably don't want it to automatically pause.

So I added the master limiter, using the basic JUCE class as I expected it to be temporary. This seemed to work fine, guaranteeing that nobody's ears were melted by gateways to Hell.

Later when I added voice instancing, the physics explosion problem became more of an issue. Due to the way that Anukari uses time dilation to create higher pitches, every instrument will ultimately have a highest note that it can play without exploding, because the physics time step gets too large. So if you play a scale up the keyboard, you'll eventually hit a note that can't be simulated. The circuit breaker could catch this, but that's an awful user experience, since the whole simulation is paused.

Here I added automatic per-voice physics explosion detection. The most reliable signal I found was to monitor the maximum squared velocity of any object, and if it exceeds a given threshold, the given voice instance is automatically returned to its resting state. So if you play a note that's too high, it just won't do anything, or at worst you might get a light click and then silence. This way, when you play into a range that's not supported, the higher notes just don't make sound. Everything else keeps working.

I should also mention that at some point after I added the master limiter, I added compression for the microphones. This also massively reduced the possibility of producing gates to Hell, as even if they happen, the compressors will likely reduce the gain substantially and it won't be so bad.

Getting back to the master limiter, for a while I had noticed some very light crackling that I couldn't explain on some presets, such as SFX/Broken Reactor. It only happened with several voices playing loudly, but it was audible. Originally I assumed it was a problem with my compressor implementation, but I disabled the compressors and it still crackled. Ultimately I just kept disabling features until the crackle went away, and lo and behold, it was the JUCE Limiter class that was causing crackles.

Of course when I looked at the limiter code, I found a comment I wrote a year or two ago saying that the limiter crackled when the limit was set above 0 dBFS. I guess I thought I had fixed this by clamping the limit to a maximum of 0 dBFS, but I hadn't listened hard enough to realize that artifacts were possible below that as well.

The funny thing was: with the limiter disabled, some presets sounded way better. Not due to the absence of artifacts, since those were limited to some kind of weird presets. The dynamic range was much higher, which is one of the things I've always enjoyed about the sounds Anukari can make. Especially with percussive or metallic sounds, it's so important to have a lot of dynamic range.

JUCE's Limiter class employs two compressors with fixed parameters in series before a hard limiter with adjustable dBFS and threshold/release parameters. It turns out that it shapes the sound pretty significantly even when it's well below the hard limit.

Given that JUCE's Limiter sounded really bad for my use case, in addition to the crackling, I decided not to spend any time trying to fix it. I chose to get rid of any kind of shaping limiter entirely, and instead I went with a simple hard limit at +6 dbFS. Okay, not entirely hard, there's a polynomial taper, but it's pretty hard. I chose this threshold because it's easy to avoid clipping, and if the system goes haywire your eardrums will still be protected.

Voila, no more crackling, and way more dynamic range. This was a huge improvement.

Preset LUFS

After getting rid of the master limiter, I ran into a big issue, which is that many of the presets were much louder. In other words, they were relying on the master limiter to control their loudness. No wonder the dynamic range was squashed!

This meant that I had to go through and re-level all of the 200+ factory presets. This is something I wanted to do for a long time; the presets I made and the ones Jason made had pretty different loudness, and especially the ones I made were kind of all over the place.

To get this right, I installed the Youlean Loudness Meter 2 plugin to measure Anukari's integrated LUFS. This gave me an objective loudness metric. I targeted -15.0 LUFS for each preset under "typical" playing circumstances. The "typical" playing is a bit arbitrary, but I wrote some MIDI clips that I felt were reasonable for various kinds of presets. Big 4-note chords for pads and mallet instruments, fast lines for melodic instruments, single repeated notes for percussion, stuff like that.

While the LUFS metric was incredibly helpful, especially given how much ear fatigue I built up after many hours of leveling presets, I still relied on my ear to make the final judgement. Especially for instruments with very short note duration, integrated LUFS was not a great metric, and I was looking more at instantaneous LUFS and also simply listening.

It ended up taking two full passes over the presets to get the levels to a point where I was happy with them. But it was really worthwhile! Now you can cycle through the presets quickly, playing a couple notes on each one, and the volume level is far more consistent than before. You never have a preset jump out being twice as loud as the previous one. It feels much more professional.

The presets in general ended up being a bit quieter than before, so I also added a master output level knob. This should help especially in the standalone app when you want all presets to be a bit louder, and don't want to have to fiddle with the per-preset gain.

In addition, because I spent a lot of time cycling through presets, I made it so that when changing presets there's a very brief fade out/in. It wasn't a big deal, but if a preset was making noise when you cycled to the next one, there was a definite click. Now there's some softening to avoid any click. And I added this click-suppression in a couple other places, such as when the simulation is paused. It's a small thing but really feels good.

No More Ringing at Rest

Another issue that had long plagued Anukari was that some instruments would make a weird ringing sound when they were at rest. Basically, there was a digital noise floor. For most instruments, this was only audible if you cranked up the gain. But for instruments with extremely stiff springs, or lots of microphones, it was very audible. The worst offender was Mallet-Metallic/4 Ding Chromatic. It is one of my favorite presets, but it was really noisy.

Over the years I made several attempts to fix this, each time failing. I ran quite a few experiments on different formulations for the damping equations, since the ringing indicated that the system was somehow retaining energy. I did reduce the noise floor a bit with some very subtle changes to the damping integration, but never could get it to go away entirely.

For performance reasons Anukari uses single-precision (32-bit) floating point arithmetic for all the physics calculations. I always wondered whether using double-precision (64-bit) would help, but back in the GPU days this was not really an option, because many GPU implementations do not support doubles, and the ones that do are not necessarily very fast. In OpenCL, double support is optional and mostly not offered.

But a deeper problem with doubles on the GPU was that the physics state had to be stored in threadgroup memory, which is extremely limited. Doubling the size of the shared physics state structure would cut the number of entities that could be simulated in half, making many presets unusable.

Anyway, the new CPU physics implementation does not have the limitation of storing everything in the tiny GPU threadgroup memory. It's true that doubles will still use twice as much memory as floats, and that may have performance effects from reading more memory, and of course the SIMD operations would have half the width as the float versions. But I figured... why not give it a shot?

I hacked together the worst AI slop prototype of double support, being careful to only use double precision for the absolute minimal set of physics operations that might affect the ringing issue, and voila, the ringing was completely gone. It was always simply due to the lack of precision in 32-bit floats. This makes a lot of sense; basically with stiff enough springs and high enough gain, the closest position that a 32-bit float could represent to the true lowest-energy state might contain enough error to matter. At each step, a small force would be calculated to push things towards equilibrium, but the system would only orbit around equilibrium in accordance with the available floating point precision. (Of course 64-bit doubles still behave this way, but the error is way, way too small to be audible even with extremely high gain.)

Using doubles is slower than floats, for sure. But there are a couple things that made this change possible.

First, the slowest part of the simulation is the random access lookups to read the positions of the masses that springs are connected to, to calculate the spring forces. These lookups (and force writes) did not get appreciably slower! This may be surprising, but the reason why is pretty simple. All the processors that Anukari runs on use 64 bytes as the size of a cache line. The position of a mass is a three dimensional vector, which is really four dimensions for alignment reasons. So for 32-bit floats that's 16 bytes, and for 64-bit doubles it's 32 bytes. Notice that both sizes of floating point representation, the vector fits into one cache line. Because the lookups and writes are random access, and the memory being accessed is often larger than L1 cache, in both cases full cache lines are being read and written, and the size of the float makes no difference.

Second, while the SIMD computation bandwidth is cut in half for the 64-bit operations, in many cases the latency of the computations is eclipsed by the memory latency. The code is written carefully to ensure that computation and memory access are pipelined to the maximum extent. So in the situations where the memory access was the dominating factor, adding extra computational instructions didn't actually increase the runtime.

That said, even with a lot of optimization and luck, 64-bit floats are slower, so the third factor is that I did a bunch of small optimizations to other parts of the code to speed it up enough to pay back the runtime penalty of the 64-bit operations. In the end I was able to make it net neutral in terms of speed, with the huge audio quality improvement from doubles.

I am extremely pleased that this is no longer an issue!

Anukari on the CPU (part 3: in retrospect)

Captain's Log: Stardate 79350.3

In part 1 of this series of posts, I explained the shortcomings in Anukari’s GPU implementation, and in part 2, I covered how the new CPU implementation manages to outperform the GPU version.

Prior to this change, I had invested a significant amount of time and effort on the GPU implementation (as the back catalogue of this devlog demonstrates). Obviously I would have preferred to find the improved CPU solution earlier, before putting so much effort into the GPU solution. In this 3rd and final installment I will reflect upon what went well, and what I could have done better.

The (unachieved) goal of 50,000 physics objects

As I discussed in part 1, my first prototype used the CPU. It was with this prototype that I “proved” to myself that to simulate the target number of physics objects that I wanted (50,000), the CPU was not an option.

I still think that this is true. Actually I’m even more confident, having now written a heavily-optimized CPU simulator. But where I went wrong was in my assumption about it being important to simulate 50,000 objects.

In my initial testing, I found that larger physics systems produced some really interesting results. I can’t quite recall how big of a system I had tested, but surely it was only a few hundred masses, maybe a thousand objects counting the springs.

From this I got excited and extrapolated out a bit too far. If 1,000 objects sound cool, 50,000 must sound even cooler, right?

Now, given that I’m currently writing about the avoidance of unfounded assumptions, I better not rule out the idea that 50,000 objects is really great. Maybe it is! But I never proved that. Furthermore, I did know that simulating 1,000 objects had excellent results.

Maybe I could have saved myself a lot of grief if I had questioned this assumption before building out all the GPU code.

On the other hand, I do want to give myself credit for being ambitious by going for the 50,000 goal. So I am not sure that I’ll criticize myself too much for going ahead with the GPU implementation. Overall I’d rather err on the side of being too ambitious than the opposite.

I think my real self-critique is not that I embarked on the GPU road, but that I stayed on it too long. When I first realized that I was not going to achieve my goal of 50,000 objects, that would have been a great time to take a step back and reevaluate whether the GPU was necessary.

At Google we often loosely used the rule of thumb that a 10x increase in system workload was about the time when you had to start thinking about redesigning a system (rather than incrementally optimizing it).

So changing my design goal by a factor of 50x should have been an obvious signal that maybe a different, simpler design was worth evaluating!

That said, at this point in the project I was not yet aware of how much of a headache using the GPU was going to be. So maybe I would have charged forth with the GPU anyway.

With these reflections out of the way, the reality of course is that I did eventually reevaluate my options and found a better solution. I do wish I had done this earlier, but I am proud of how quickly I changed horses once I began to realize that the one I was on was not optimal. I did not allow myself to succumb to the sunk cost fallacy, which is something I find personally challenging. So I’m happy about that.

And overall, of course, I’m pleased to have found a solution that works significantly better for my users, which is what’s really important to me.

On the drawbacks of using the CPU

The new CPU simulation is way faster and simpler than the GPU implementation, but that doesn’t mean there aren’t disadvantages.

One drawback to the change is that it might make me look foolish. I’m not too worried about this, though. I’d rather eat some crow and have a plugin that works really well than the alternative.

Another drawback is that the GPU support was potentially good from a marketing perspective. Using the GPU is unusual, and stands out as something interesting, like “alien technology.” Again, I am not going to lose any sleep over this. I didn’t write GPU code to make the plugin marketable. I used the GPU because at the time I thought it was the most effective way to make the plugin work really well.

The engineering drawbacks are more interesting. One advantage of the GPU is that it is mostly untapped processing power (for audio). So a user with a super CPU-hungry audio production setup might be able to run a GPU-based plugin, taking advantage of that extra processing capacity that would normally go unused.

This is in fact a great reason for plugin manufacturers to exploit the GPU. But of course this only makes sense if a plugin can be made to work really well on the GPU. I do not doubt that this is true for other plugins, but as I have written about at great length, for now Anukari runs much better on the CPU.

The way I see it, a plugin that works great but uses more CPU resources is always better than a plugin that works poorly and glitches but consumes less CPU. For me, Anukari’s usability trumps everything else. If users can’t reliably run interesting presets without glitching, nothing else matters.

That said, I do feel sad about anyone whom I’m disappointed with this change. It’s completely understandable that someone may have been excited by the GPU support (I sure was!), for any number of reasons. I do not enjoy letting anyone down.

But it’s not like I didn’t try to make the GPU support work! I invested many hundreds, if not thousands of hours into that approach. This devlog attests to that.

I also have to think about those users whom I’d disappoint if I didn’t improve Anukari’s performance. Many people’s machines could not run Anukari at all, and now they can. Many users who had glitching in Logic Pro before now find that the plugin runs flawlessly. VJs that run GPU-intensive visualization software can now run Anukari. The #1 complaint I’ve had since the start of the Beta is performance, and that is way less of an issue now.

Benefits of using the CPU aside from performance

Recall that for GPU support I was maintaining three backends: CUDA, Metal, and OpenCL. (Arguably I should also have supported AMD’s ROCm, but it’s a huge mess. That’s a whole other story.)

Now I’m just maintaining the one CPU backend. Granted, there is separate hand-written AVX and NEON code to support the two platforms, but this is limited to the hottest loops, and is not a big deal. AVX and NEON are far more similar to one another than the various GPU hardware and APIs are.

Overall the new simulator is vastly simpler than the old one. There’s much less indirection, since instead of orchestrating GPU kernels (including setting up buffers, managing data copies, etc) the code simply does the work directly.

This means that adding new features to the physics simulation is going to be substantially easier. This is exciting, because I have a gigantic spreadsheet of new features I’d like to add.

There’s also the fact that I just got all the future opportunity cost from dealing with GPU issues back. In other words, instead of spending hundreds more hours dealing with GPU compatibility issues, those hours will now go into new physics features, UX improvements, etc.

Also from a reliability perspective, the CPU code is far easier to test, profile, and debug. I can throw in log statements wherever I like. I can use the same instrumentation, profiler, and debugger as for the rest of the app. Unit tests are much simpler to write without having to call into the GPU code. And manual testing prior to releases no longer requires going through quite so big of a stack of laptops.

Keep in mind that I was previously spending time on horrific GPU issues like this AMD driver bug which was specific to just one AMD graphics chip. Now I have that time back.

I am really looking forward to making Anukari an even more interesting and useful sound design tool now that my time has been freed up to do so.

How tests helped make the GPU to CPU change possible

One thing that helped a lot in rewriting the physics simulation was the existing body of tests. Since the prior GPU implementation supported multiple backends, all the tests were already backend-agnostic. These tests are extremely thorough, so while writing the CPU backend, I was basically doing test-driven development.

I can’t imagine how much harder this would have been without the tests. They caught way more issues than I can count. And once I had them all passing, it gave me enormous confidence that the new simulation was working correctly.

Far and away the most useful tests were my golden tests. At the moment I have close to 150 tests that each load an Anukari preset, feed in some MIDI/audio data, and generate a short audio clip. This clip is fingerprinted and compared to a “golden” clip. Most of the golden tests isolate individual physics features to prove that they work, but there are a few tests that run large, complex presets to prove that interactions across features work as well.

Once the golden tests were passing, I was completely sure that I had implemented all the physics features correctly.

Correct audio output is obviously important, but stability is just as important. The CPU code is heavily-optimized and does a lot of low-level memory access stuff where it’s easy to screw up and crash by accessing the wrong memory.

For this, I mostly relied on fuzz testing and the chaos monkey.

For the fuzz tests, I have a testing API that can generate random mutations to an Anukari preset. The API is, in principle, capable of generating any Anukari preset. When I’m adding a new physics parameter, one of the very first things on my checklist is to update the fuzz API to be aware of it.

I use this API in a number of fuzz tests. The basic pattern is a loop that makes a random preset mutation, and then checks an invariant. Doing this tens of thousands of times tends to catch all kinds of bugs, especially because the random presets it generates are just awful, twisted messes. When randomly setting parameter values, the fuzz API has a slight preference for picking values at the extremes of what Anukari supports, so it ends up generating super weird presets that a human would never create.

One of the fuzz tests generates presets in this way and then simulates them, verifying that the simulation’s output is reasonable (i.e. no NaN/inf, etc). And of course these tests verify that no inputs can cause the simulator to crash. Running the fuzz test loop for a long enough time with no crashes gave me huge confidence that I had worked out all of the crash bugs in the new simulator.

The chaos monkey is sort of the last line of defense. It opens the standalone Anukari app and generates random mouse and keyboard inputs. This has mostly been useful for catching GUI bugs and crashes, but it also verifies that the simulator does not crash in real-world conditions.

It’s hard to say, but I think that without all these tests, my CPU rewrite attempt may have simply failed. At the very least it would have taken 5x as long. The golden tests were especially important. I can’t imagine manually setting up presets and checking that 100+ individual physics features worked correctly, over and over.

The value of generative AI

I wrote above about how I wish I would have tried a SIMD-optimized CPU solution for Anukari’s physics simulation earlier in the project.

One subtle issue with that idea, though, is that 2 years ago I wasn’t really using GenAI for programming. I had played with it a bit, of course, but it wasn’t a core part of my workflow.

I’m not sure how successful my SIMD attempts would have been without GenAI. I’m no stranger to assembly code, but I am far from fluent. We might say that I speak a little “broken assembly.” The main issue is that I write assembly very slowly, as I have to frequently reference the documentation, and I’m not always aware of all the instructions that are available.

So if I had attempted the SIMD approach a couple years ago, it would have gone way more slowly. I could not have experimented with multiple approaches without a lot more time.

Starting about a year ago, GenAI became a core part of my programming workflow. I don’t find that “vibe coding” works for me, but GenAI is an amazing research assistant.

By far the highest leverage I’ve found with GenAI is when I am learning a new API or technology. Having the LLM spit out some example code for the problem I’m solving is incredibly valuable. I always end up rewriting the code the way I want it, but it saves a ridiculous amount of time in terms of figuring out what API functions I need to call, what headers those come from, the data types involved, etc.

For writing the SIMD code, GenAI was a massive superpower. Instead of constantly fumbling around in the documentation to figure out what instructions to use, I asked GenAI and it immediately pointed me in the right direction.

I mostly wrote AVX code first, and then ported it to NEON. This is probably the closest that I approached to vibe coding. In many instances, asking GenAI to translate AVX to NEON, it produced perfect working code on the first try, at least for simple snippets.

Don’t get me wrong, GenAI also produced a ton of absolute garbage code that wasn’t even close to working. But that’s not really a problem for me, since I don’t use it for vibe coding. I just laugh at the silly answer, pick out the parts that are useful to me, and keep going.

Not only did GenAI save me a lot of time in scouring documentation to find what I needed, but also I ended up running experiments I otherwise would not have. This is one of those situations where a quantitative effect (writing code faster) turns into a qualitative effect (feeling empowered to run more interesting experiments).

The “superpower” aspect was really the fact that GenAI emboldened me. Instead of worrying about whether it was worth the time to try an optimization idea, I just went ahead with it, knowing that my research assistant would have my back.

Appendix: Further challenges with using the GPU for audio

One last detail to mention is that running Anukari’s simulation on the GPU also incurs overhead from scheduling GPU kernels, waiting for their results, etc. While it’s nice to avoid this overhead, it’s actually extremely tiny on modern machines. Especially with Apple’s Metal 4 API and unified memory, this overhead is not very consequential, except at the very smallest audio buffer sizes.

However even if the fast-path overhead of GPU scheduling is good, there are still problems. Notably, GPU scheduling is not preemptive. Once a workload is scheduled, it will run to completion. This means that if another process (say, VJ visualization software) is running a heavy GPU workload, it can interfere with Anukari’s kernel scheduling, leading to audio glitches. To some extent this can be ameliorated using persistent kernels, but all kernels have deadlines and thus need to be rescheduled periodically, opening the door to glitches.

The fact that GPU tasks are not preemptive is a fundamental issue in the world of realtime audio. It’s not unsolvable (e.g. permanent task persistence could be a solution), but it is tricky. It’s also worth noting that the macOS Core Audio approach of using workgroups for audio threads does not apply to the GPU, so the OS has no way of knowing that a GPU task is realtime. This is an OS-level feature that Apple could add to make GPUs more audio-friendly. But given how little the GPU is used for audio, it seems very unlikely that Apple will invest in this.

I don’t think that true preemption on the GPU is something that would ever be realistic. Between the huge amount of register memory and threadgroup memory that GPUs have, switching a threadgroup execution unit between workloads would be way too expensive. We’re talking about copying half a megabyte of state to main memory for a context switch.

This means that even if GPU APIs supported kernel priority (which they mostly don’t), it still could not solve the audio glitch issue, because if the GPU was already fully utilized with running tasks, even a realtime-priority task could not preempt them. The priority would only mean that the task would be the first to start once the existing workload was finished.

Probably the only true solution for flawless audio on the GPU would be for the OS to provide support for reserved GPU cores, alongside realtime priority for low-latency signaling between the CPU and GPU. This would allow realtime applications to run persistent kernels on the reserved cores without any issues with other workloads “sneaking in.”

I believe that Apple might be able to figure out a solution to this and pull it off. NVIDIA also. Definitely not AMD, though, their drivers are hopeless to begin with, even for the simplest use cases. Also I would not expect Intel’s integrated graphics to do this. So even in a perfect world where GPU vendors decide to make audio support a first-class priority, in my opinion its usefulness will be limited to the two top vendors, and users with Radeon or Intel graphics will not benefit.

(Maybe at some point I will write more about these challenges, and the solutions I came up with, but for now I am mostly happy to just move to the CPU and forget all these complex headaches.)

Loading...

FacebookInstagramTikTokBlueskyTwitterDiscordYouTubeRedditAnukari NewsletterAnukari Newsletter RSS Feed
© 2026 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.