devlog > ram

Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

RAM savings in the 0.9.26 release

Captain's Log: Stardate 79645.8

I finally got the 0.9.26 release out yesterday (download, release notes), and I'm really happy with everything I managed to get into it.

Anukari's RAM use had been bothering me for some time. A kind of funny thing is that I became especially aware of it after I moved the audio processing from the GPU to the CPU.

Delay Line Buffer Optimization

Anukari supports delay lines of up to 1 second in duration. Delay lines run from the virtual microphones to audio input exciters. So each mic needs to store 1 second of audio sample history. Anukari supports up to 50 mics, and to complicate things further, in polyphonic mode, there may be up to 16 instances of each mic. So storing 1 second of 32-bit precision audio data at 48 kHz for 16 instances of 50 mics requires ~154 MB of RAM.

Since the physics simulation is running in a realtime audio thread, it is highly undesirable to do dynamic memory allocation. Generally speaking, memory allocation is a blocking operation, and often will acquire a mutex, and even if it doesn't, it might take a while if memory is fragmented, etc. So for any memory blocks used by the audio thread, at startup Anukari allocates buffers that are of the maximum required size. This eliminates the need for dynamic allocation at runtime, but insofar as those buffers go unused, it wastes memory.

When Anukari's physics simulation ran on the GPU, the 154 MB delay buffer was allocated in GPU memory. This was actually kind of handy, because most audio applications don't use much GPU memory, so in some ways this memory was "free." Of course that doesn't apply to Apple's unified memory architecture. But for many users, this RAM usage was not really visible.

When I moved the simulation to the CPU, 154 MB of RAM usage that was sometimes hidden before became quite visible. Especially for users running many instances of Anukari in parallel, this can add up pretty quickly.

For the 0.9.26 release I decided it was time to do something about this. Most Anukari presets have fewer than 50 mics, don't use all 16 voices of polyphony, and don't have delay lines attached to every single mic. In fact many presets don't even have a single delay line, so this 154 MB of RAM is completely wasted.

Now, could I get away with just dynamically allocating this memory in the audio thread? Probably. The allocation would only happen when the user adds a delay line to an existing preset, increases the polyphony setting, etc. If a small audio glitch happened at that moment it probably wouldn't be super noticeable.

But I've taken a very hard line on this issue. My goal is for Anukari's audio to be absolutely rock solid with no compromises. I'm not claiming I've fully accomplished that, but it's the north star. So that means no allocation on the audio thread, no taking the easy way out.

The solution I came up with is as follows:

There is a background allocator thread.
Delay buffer pointers are atomics.
The audio thread can describe the allocations it needs in a struct that gets pushed into a lock-free SPSC queue that the allocator thread reads.
The allocator thread looks at each new allocation spec, and if required it will allocate the required memory and update the atomic delay buffer pointers.

This allows the memory allocation to happen in a non-realtime thread without ever blocking the audio thread.

There's an incredibly important wrinkle though: when a buffer is no longer needed, somehow we need to free it, or else we'll leak memory. This is a bit subtle, because somehow we need to guarantee that the audio thread will never again access a given buffer before we free it, otherwise invalid memory access will occur and we'll probably crash.

The trick I used to handle deallocation is as follows:

When the allocator thread reallocates or deallocates a buffer, it updates the atomic pointer, and then pushes the old pointer to a local "to-free" queue, in a struct that also contains an epoch integer.
The audio thread and allocator thread share an atomic epoch counter, which the audio thread increments each time it finishes processing an audio block.
The audio thread is allowed to safely access stale delay buffer pointers until it increments the epoch value.
The allocator thread periodically checks the next item in the to-free queue. If the item's epoch is less than the atomic epoch counter, it is safe to free it.
Thus, as long as the audio thread never caches a buffer pointer past the point when it increments the epoch counter, it will never access freed memory.

This may sound a bit complicated, but it's all less than a hundred lines of code. I am very thankful that I have a bunch of thorough fuzz tests to give me confidence that it works. :)

Note that this solution can't immediately be generalized to all memory allocations. Because it is asynchronous, and the allocator thread is not realtime, there will be a period where the audio thread has requested memory but does not yet have it. So this technique only applies to cases where the audio thread can operate without the memory for that duration.

This is fine for delay lines. It just means that after the user adds a new delay line, it may take a moment before the delay line's audio signal begins arriving at its destination. In practice, this duration is imperceptibly small, and doing things this way guarantees realtime safety so there will never be an audio glitch when adding a delay line.

Other RAM Savings

The delay line buffer allocation was the most interesting/difficult RAM optimization, but I did many other optimizations that were a lot simpler.

First, I tightened up the static size for several buffers and queues. There were several instances where I had sized a buffer way too conservatively, and was able to save a few tens of megabytes of RAM this way. Again, here I was relying on my comprehensive fuzz tests (and unit tests) to make sure that I didn't break anything by resizing these buffers.

The largest and most embarrassing savings of all, though, was by evicting assets from Anukari's 3D renderer cache. This was essentially trivial to do and obviously safe, so a bit of a facepalm that I hadn't done it before.

A bit of background: I have spent many hours obsessing over how quickly Anukari loads, both from a cold start as well as when the GUI is closed and re-opened when running as a DAW plugin.

One bottleneck in the cold start was the parsing and initialization of the 3D assets. This includes decoding the files, decompressing them when necessary, building collision data structures, stuff like that. I did some testing and found that parallelizing this work massively reduced the GUI's cold start latency.

The way I parallelized this is as follows:

The GUI enumerates all the assets it needs.
The asset cache then spawns a bunch of threads and loads the assets in parallel, writing the final results to a dictionary cache.
The GUI reads the assets from the cache as they become available.

The memory-wasting problem was that after the GUI finished reading all of the asset data, it never told the asset cache that it was finished. So the asset cache just stuck around for the lifetime of the 3D renderer, wasting memory for no reason.

This was an easy fix! The GUI already knew when it was done with the cache, so I simply had to add a line of code to clear it. This saved 100+ MB of memory for simple skyboxes and 3D assets. But for ultra high-resolution skyboxes and more detailed 3D models, it saves way more memory, potentially a few hundred MB. Pretty nice!

by Evan at 3/7/2026, 5:35:08 PMoptimization ram