devlog > bug

Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Working better on some Radeon chips

Captain's Log: Stardate 79013.9

The issue with Radeon

As discussed in a previous post, I've been fighting with Radeon mobile chips, specifically the gfx90c. The problem originally presented with a user that had both an NVIDIA and a Radeon chip, and even though they were using the NVIDIA chip for Anukari, somehow in the 0.9.6 release something changed that caused the Radeon drivers to crash internally (i.e. the drivers did not return an error code, they were simply aborting the process).

I'd like to eventually offer official support for Radeon chips. That's still likely a ways off, but at the very least I don't want things crashing. Anukari is extremely careful about how it interacts with the GPU, and when a particular GPU is not usable, it should (preferred) simply pick a different GPU, or at the very least, show a helpful message in the GUI explaining the situation.

Unfortunately it was difficult to debug this issue remotely. The user was kind enough to run an instrumented binary that confirmed that Anukari was calling clBuildProgram() with perfectly valid arguments, and it was simply aborting. I really needed to run Anukari under a debugger to learn more.

So I found out what laptop my bug-reporting user had, and ordered an inexpensive used Lenovo Ideapad 5 on eBay. I've had to buy a lot of testing hardware, and I've saved thousands of dollars by buying it all second-hand or refurbished. In this case it did take two attempts, as the first Ideapad 5 I received was super broken. But the second one works just fine.

Investigation

After getting the laptop set up and running Anukari under the MSVC debugger, I instantly was seeing debug output like this just prior to the driver crash:

LLVM ERROR: Cannot select: 0x1ce8fdea678:
ch = store 0x1ce8fe462a8, 0x1ce8fde8fb8, 0x1ce8fe470b8,
  undef:i32
  0x1ce8fde8fb8: f32,ch = load 0x1ce8fdd7638, 0x1ce8fde8b80,
  undef:i64
    0x1ce8fde8b80: i64 = add 0x1ce8fcbc600, Constant:i64<294>
      0x1ce8fcbc600: i64,ch,glue = LD_64 
        TargetExternalSymbol:i64'arguments', Register:i64 %noreg, 
        TargetConstant:i32<0>, TargetConstant:i32<4>,
        TargetConstant:i32<4>, TargetConstant:i32<8>,
        TargetConstant:i32<34>, TargetConstant:i32<1>, 0x1ce8a26ec90
        0x1ce8fce97f8: i64 = TargetExternalSymbol'arguments'
        0x1ce8fcc3148: i64 = Register %noreg
        0x1ce8fcbc330: i32 = TargetConstant<0>
        0x1ce8fcbbfe8: i32 = TargetConstant<4>
        0x1ce8fcbbfe8: i32 = TargetConstant<4>
        0x1ce8fcbbe08: i32 = TargetConstant<8>
        0x1ce8fcbc768: i32 = TargetConstant<34>
        0x1ce8fcc1f58: i32 = TargetConstant<1>
      0x1ce8fde8b08: i64 = Constant<294>
    0x1ce8fcc2c20: i64 = undef
  0x1ce8fe470b8: i32 = add FrameIndex:i32<30>, Constant:i32<294>
    0x1ce8fdea420: i32 = FrameIndex<30>
    0x1ce8fe47040: i32 = Constant<294>
  0x1ce8fdea498: i32 = undef

First of all, I want to call out AMD on their exceptionally shoddy driver implementation. It's just absurd that they'd allow a compilation error internal to the driver to abort the whole process. Clearly in this case clBuildProgram() should return CL_BUILD_PROGRAM_FAILURE, and the program log (the compiler error text) should be filled with something helpful, at a minimum, the raw LLVM output, but preferably something more readable. This is intern-level code, in a Windows kernel driver. Wow.

After reading through this carefully, all I could really make of it was that LLVM was unable to find a machine instruction to read data from this UpdateEntitiesArguments struct in addrspace=7 and write it to memory in addrspace=1. From context, I could guess that addrspace=1 is private (thread) memory, and addrspace=7 is whatever memory the kernel arguments are stored in. I had a harder time understanding why it couldn't find such an instruction. I thought maybe it had to do with an alignment problem, but wasn't sure.

This struct contains a number of fields, and I couldn't tell from the error which field was the problem. So I just used a brute-force approach and commented out most of the kernel code, and added code back in slowly. It compiled fine until I uncommented a line of code like float x = arguments.field[i]. I did some checking to ensure that field was aligned in a sane way, and after confirming that, I came to the conclusion that the gfx90c chip simply does not have an instruction for loading memory from addrspace=7 with a dynamic offset. In other words, the gfx90c appears to lack the ability to address arrays in argument memory with a non-constant offset.

Which, as far as I can tell, means that the gfx90c really doesn't support OpenCL properly. Every other OpenCL implementation I've used can do this, including NVIDIA, Intel Iris, Apple, and even newer Radeon chips like the gfx1036. I don't see anything in the OpenCL specification that would indicate that this is a limitation.

But even assuming that it's somehow within specs for an OpenCL implementation not to support this feature, obviously aborting in the driver is completely unreasonable behavior. Again, this is a really shoddy implementation, and when people ask about why Anukari doesn't yet officially support Radeon chips, this is the kind of reason that I point to. The drivers are buggy, and worse they are inconsistent across the hardware.

The (very simple) workaround

Anyway, I have very good (performance) reasons for storing some small constant-size arrays (with dynamic indexes) in kernel arguments, but those reasons really apply more to the CUDA backend. So I made some simple changes to Anukari to store these small arrays in constant device memory, and the gfx90c implementation now works just fine.

Given that I upgraded my primary workstation recently to a very new AMD Ryzen CPU, I now have two Radeon test chips: the gfx90c in the Ideapad 5, and the gfx1036 that's built-in to my Ryzen. The Anukari GPU code appears to work flawlessly on both, though doesn't perform all that well on either. Next up will be doing more testing of the Vulkan graphics, which have also been a pain point in the past on Radeon chips.

by Evan at 7/19/2025, 9:48:15 PMgpu bug radeon

Multichannel, ASIO, Radeon, and randomization

Captain's Log: Stardate 79000.1

Whoa, it's been way too long since I updated the devlog. Here goes!

2025 MIDI Innovation Awards

Really quickly: Anukari is an entry in the 2025 MIDI Innovation Awards, and I would really appreciate your vote. You can vote on this page by entering your email and then navigating to the Software Prototypes/Non-Commercial Products category and scrolling way down to find Anukari. You have to pick 3 products to vote in that category. (I wish I could link to the vote page directly, but alas, it's not built that way.)

The prize for winning would be a shared booth for Anukari at the NAMM trade show, which would be a big deal for getting the word out.

Multichannel I/O Support

A while back, Joe Williams from CoSTAR LiveLab reached out to me asking if Anukari had multichannel output support. Evidently the UK government is investing in the arts, which as an American is a pretty (literally) foreign concept. One of the labs working on promoting live performance is LiveLab, and they have a big 28-channel Ambisonic dome. Joe saw Anukari and thought it would be cool to create an instrument with 28 mics outputting to those 28 speaker channels.

I'd received several requests for multichannel I/O, but hadn't yet prioritized the work. The LiveLab use case is really cool, though, and Anukari will be featured in a public exhibit later this month, so I decided to prioritize the multichannel work.

Anukari now supports 50x50 input/output channels. In the standalone app, this is really simple, you just enable however many channels your interface supports and then inside Anukari you assign each audio input exciter or mic to the channels you want.

It also works for the plugin, but how you utilize multichannel I/O is very DAW-dependent. Testing the new feature was kind of a pain in the butt, because I have about 15 DAWs for testing, and multichannel is a bit of an advanced feature, so I ended up watching a zillion tutorial videos. Every DAW approaches it a bit differently, and the UX is generally somewhat buried since it's a nice feature. But it works everywhere, and it is extremely cool to be able to map a bunch of mics to their own DAW tracks and give them independent effects chains and so on.

Behind the scenes, it was really important to me that the multichannel support did not impact performance, especially when it was not in use. I'm very happy to say I achieved this goal. When you're not using multichannel I/O, there is zero performance impact. And even in 50x50 mode the impact is very low. Anukari is well-suited for multichannel I/O since each mic is tapping into the same physics simulation at different points/angles, so none of the physics computations have to be repeated/duplicated. Really the only overhead is copying additional buffers into and out of the GPU. On the Windows CUDA backend, that's a single DMA memcpy, which is very fast. And on the macOS Metal backend, it's unified memory, so no overhead at all. All that remains is the CPU-CPU copy into the DAW audio buffers, which is very, very fast.

I look forward to posting about the LiveLab exhibit once it happens.

ASIO Support

It's a pretty big oversight that the Windows version of the Anukari Beta launched without ASIO support. I'm not quite sure how I missed this important feature, but I've added it now.

I think I always assumed it was there, but when using JUCE the ASIO support is not enabled by default because you need to get a countersigned agreement from Steinberg to use their headers to integrate with ASIO. I already had a signed agreement with them for the VST3 support, but ASIO is a completely separate legal agreement and so I went through the steps to get that as well.

ASIO support makes the standalone app perform much better (in terms of latency) for people with ASIO compatible sound interfaces.

AMD Radeon Crashes

Officially speaking, Anukari explicitly does not support AMD Radeon hardware. This is a bit of a long story, which at some point I will write about in more detail. But the short version is that the Radeon drivers are incredibly inconsistent across the Radeon hardware lineup, which makes it extremely difficult to offer full support. For some Radeon users, Anukari works perfectly, and for others it is unstable, glitchy, or crashes, in many different unique ways.

The story I'll write about for this devlog entry, though, is the extremely frustrating case that I solved for users that have both an AMD Radeon and an NVIDIA graphics card in the same machine. This is actually a common situation, because many (most? all?) AMD Ryzen CPUs include integrated Radeon graphics on the CPU die. So for example there are a lot of laptops that come with an NVIDIA graphics card, but also have a sort of "vestigial" Radeon in the CPU that is normally not used for anything.

In the past, Anukari just worked for users with this configuration, since when it detects multiple possible GPUs to use for the simulation, it would automatically select the CUDA one as the default. However in the 0.9.6 release, Anukari began crashing instantly at startup for these users.

This was pretty confusing, because I have comprehensive fuzz and golden tests that exercise all the physics backends (CUDA, OpenCL, Metal). These tests abuse the simulation to an extreme extent, and I run them under various debugging/lint tools to make sure that there are no GPU memory errors, etc. And across my NVIDIA, macOS, and Intel Iris chips, they all work perfectly.

Luckily I had a user who was extremely generous with their time to help me debug the issue. I sent them instrumented Anukari binaries, and eventually was able to pinpoint that it was crashing inside the clBuildProgram() call.

Now, you might think that what I mean is that clBuildProgram() was returning an error code, and I was somehow not handling it. No, Anukari is extremely robust about error checking. I mean it was crashing inside the kernel and clBuildProgram() was not returning at all due to the process aborting. This is with perfectly valid arguments to the function. So, obviously, this is a horrible bug in the AMD drivers. Even if the textual content of the kernel has e.g. a syntax error, clearly clBuildProgram() should return an error code rather than crash.

The really fun part is that I've only seen this crash on the hardware identifying as gfx90c. On other Radeons, this does not happen (though some of them fail in other ways). This is what I mean about the AMD drivers being extremely inconsistent.

Now, as to why this crash happened at startup, it's because during device discovery Anukari was compiling the physics kernel, and any device where compilation failed would be assumed incompatible and omitted from the list of possible backends. I added this feature after encountering other broken OpenCL implementations like the Microsoft OpenCL™, OpenGL®, and Vulkan® Compatibility Pack which is an absolute disaster.

So the workaround for now is that Anukari no longer does a test compilation to detect bad backends. This resolves the issue, although if the user manually chooses the Radeon backend on gfx90c it will unrecoverably crash Anukari.

Longer-term, given the Radeon driver bugs, I doubt I'll ever be able to fully support gfx90c, but I ordered a cheap used laptop off eBay with that chip in it so that I can at least narrow down what OpenCL code is causing the driver to crash. I know that it's something the driver doesn't like about the OpenCL code because it did not always crash, and the only difference in the meantime has been some improvements to that code. Hopefully I can find a workaround to avoid the driver bug, but if not I might add a rule in Anukari to ignore all gfx90c chips.

(Side-note: actually the first used laptop with a gfx90c chip that I bought off eBay was bluescreening at boot, so I had to buy a second one. These inexpensive Radeon laptops are really bad.)

Not all hope is lost for Radeon support. I recently upgraded my main development machine, and the Ryzen CPU I bought has an on-die Radeon, and it works flawlessly with Anukari. So maybe what I will be able to do one day is create an allow-list for Radeon devices that work correctly without driver issues. Sigh. It is so much easier with NVIDIA and Apple.

Parameter Randomization

Unlike the features above, this one hasn't been released yet, but I recently completed work to allow parameters to be randomized.

For Anukari this turned out to be a bit of a design challenge, since the sliders that are used to edit parameters are a bit complex already. The tricky bit is that if the user has a bunch of entities selected, the slider edits them all. And if the parameter values for each entity vary, the slider turns into a "range editor" which can stretch/squeeze/slide the range of values.

So the randomize button needs to handle both the "every selected object has the same parameter value" and "the parameter varies" scenarios. For the first scenario with a singleton value, it's simple: pressing the button just picks a random value across the full range of the parameter and assigns it to all the objects.

But for the "range editor" scenario, what you really want is for the randomize button to pick different random values for each entity, within the range that you have chosen. There's one tricky issue here, which is that it is very normal for the user to want to mash the randomize button repeatedly until they get a result they like. This will result in the range of values shrinking each time (since it's very unlikely that the new random values will have the same range as before, and the range can only be smaller)!

So the slider needs to remember the original range when the user started mashing the randomize button, and to reuse that original range for each randomization. This allows button mashing without having the range shrink to nothing. It's important, though, that this remembered range is forgotten when the user adjusts the slider manually, so that they can choose a new range to randomize within.

Another kind of weird case is when the slider is currently in singleton mode, meaning that all the entities have the same parameter value, and the user wants to spread them out randomly over a range. This could be done by deselecting the group of entities, selecting just one of them, changing its value, then reselecting the whole group, which would put the slider into range mode. But that's awfully annoying.

I ended up adding a feature where you can now right click on a singleton slider, and it will automatically be split into a range slider. The lower/upper values for the range will be just slightly below/above the singleton value, and the values will be randomly distributed inside that range. So now you can just right click to split, adjust the range, and mash the randomize button.

by Evan at 7/14/2025, 8:54:03 PMgui ux gpu radeon multichannel bug

Getting more and more stable

Captain's Log: Stardate 78592.1

The buffer-clearing saga

Adding the new AnukariEffect plugin has ended up precipitating a lot of improvements to Anukari, because it pushed me into testing what happens when multiple instances of the plugin are running at the same time. Most of my testing is done in the standalone Anukari application. It loads extremely quickly, so it's nice for quickly iterating on a new UX change, etc. But in reality, it's likely that users will mostly use Anukari as a plugin, so obviously I need to give that configuration ample attention.

The last big issue I ran into with the plugin was that in GarageBand, loading a song that had something like 6 instances of Anukari and AnukariEffect, sometimes one of the instances would mysteriously fail. The GPU code would initialize just fine, but GPU call to process the first audio block would fail with the very helpful Metal API error, Internal Error (0000000e:Internal Error), unknown reason.

After some research, it turned out that to get a more detailed error from the Metal API, you have to explicitly enable it with MTLCommandBufferDescriptor::errorOptions, and then dig it out of the NSError.userInfo map in an obscure and esoteric manner. So I had my intern (ChatGPT) figure out how to do that and finally I got a "more detailed" error message from the Metal API: IOGPUCommandQueueErrorDomain error 14.

If you've followed my devlog for a while, it should come as no surprise that I am a bit cynical about Apple's developer documentation. So I was completely unsurprised to find that this error is not documented anywhere in Apple's official documents. Apple just doesn't do that sort of thing.

Anyway, I found various mentions of similar errors, with speculation that they were caused by invalid memory accesses, or by kernels that ran too long. I used the Metal API validation tools to check for any weird memory access and they didn't find anything weird. I figured they wouldn't, since I have some pretty abusive fuzz tests that I've run with Metal API validation enabled, and invalid memory access almost certainly would have shown up before.

So I went with the working hypothesis that the kernel was running too long and hitting some kind of GPU watchdog timer. But this was a bit confusing, since the Anukari physics simulation kernel is, for obvious reasons, designed to be extremely fast. With some careful observation and manual bisection of various code features, I realized that it was definitely not the physics kernel, but rather it was the kernel that is used to clear the GPU-internal audio sample buffer.

Some background: Anukari supports audio delay lines, and so it needs to be able to store 1 second of audio history for each Microphone that might be tapped by a delay line. To avoid allocations during real-time audio synthesis, memory is allocated up-front for the maximum number of Microphones, which is 50. But also note that there can be 50 microphones per voice instance, and there can be 16 voice instances. Long story short, the per-microphone, per-instance, per-channel buffer for 1 second of audio is about 300 MB, which is kind of huge.

It's obvious that clearing such a buffer needs to be done locally on the GPU, since transferring a bunch of zeros from the CPU to the GPU would be stupid and slow. So Anukari had a kernel that would clear the buffer at startup, or at other times when it was considered "dirty" due to various possible events (or if the user requested a physics reset).

Now imagine 6 instances of Anukari all being initialized in parallel, and each instance is trying to clear 300 MB of RAM -- that's multiple gigabytes of memory write bandwidth. And sometimes one of those kernels would get delayed or slowed enough to time out. The problem only gets worse with more instances.

Initially I considered a bunch of ideas for how to clear this memory in a more targeted way. We might clear only the memory for microphones that are actually in use. But then we have to track which microphones are live. And also, the way the memory is strided, it's not all that clear that this would help, because we'd still be touching a huge swath of memory.

I came up with a number of other schemes of increasing complexity, which was unsatisfying because complexity is basically my #1 enemy at the moment. Almost all the bugs I'm wrangling at this point have to do with things being so complex that there were corner-cases that I didn't handle.

At this point you might be asking yourself: why does all this memory need to be cleared, anyway? That's a good question, which I should have asked earlier. The simple answer is that if a new delay line is created, we want to make sure that the audio samples it reads are silent in the case that they haven't been written yet by their associated microphone. For example, at startup.

But then that raises the question: couldn't we just avoid reading those audio samples somehow? For example, by storing information about the oldest sample number for which the data in a given sample stream is valid, and consulting that low-watermark before reading the samples.

The answer is yes, we could do that instead. And in a massive face-palm moment, I realized that I had already implemented this timestamp for microphones. So in other words, the memory clearing was completely unnecessary, because the GPU code was already keeping track of the oldest valid audio sample for each stream. I think what happened is that I wrote the buffer-clearing code before the low-watermark code, and forgot to remove the buffer-clearing code. And then forgot that I wrote the low-watermark code.

Well, that's not quite the whole story. In addition to the 50 microphone streams, there are 2 streams to represent the stereo external audio input, which can also be tapped by delay lines (to inject audio into the system as an effect processor). This data did not have a low-watermark, and thus the clearing was important.

However for external audio, a low-watermark is much simpler: it's just sample number 0. This is because external audio is copied into the GPU buffer on every block, and so it never has gaps. The Microphone streams can have gaps, because a Microphone can be deleted and re-added, etc. But for external audio, the GPU code just needs to check that it's not reading anything prior to sample 0, and after that it can always assume the data is valid.

Thus ultimately the fix here was to just add 2 lines of GPU code to check the buffer access for external audio streams, and then to delete a couple hundred lines of CPU/GPU code responsible for clearing the internal buffer, marking it as dirty, etc. This resulted in a noticeable speedup for loading Anukari and completely solved the issue of unreliable initialization in the presence of multiple instances.

Pre-alpha release 0.0.13

With the last reliability bug (that I know of) solved, I was finally able to cut a new pre-alpha release this Friday. I'm super stoked about this release. It has a huge number of crash fixes, bug fixes, and usability enhancements. It also turned out to be the right time to add a few physics features that I felt were necessary before the full release. The details of what's in 0.0.13 are in the release notes and in older devlog entries, so I won't go into them here, but this release is looking pretty dang good.

The next two big things on my radar are AAX support and more factory presets. On the side I've been working to get the AAX certificates, etc., needed to release an AAX plugin, and I think that it should be pretty straightforward to get this working (famous last words). And for factory presets, I have about 50 right now but would like to release with a couple hundred. This is especially important now that I've added AnukariEffect, since only a couple of the current presets are audio effects -- most of them are instruments. So I'm kind of starting from scratch there. I think it's pretty vital to have a really great library of factory presets for both instruments and effects, and also, working on them is a great way to find issues with the plugin.

by Evan at 2/15/2025, 8:42:25 PMgpu optimization macos bug