devlog > gpu

Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Working better on some Radeon chips

Captain's Log: Stardate 79013.9

The issue with Radeon

As discussed in a previous post, I've been fighting with Radeon mobile chips, specifically the gfx90c. The problem originally presented with a user that had both an NVIDIA and a Radeon chip, and even though they were using the NVIDIA chip for Anukari, somehow in the 0.9.6 release something changed that caused the Radeon drivers to crash internally (i.e. the drivers did not return an error code, they were simply aborting the process).

I'd like to eventually offer official support for Radeon chips. That's still likely a ways off, but at the very least I don't want things crashing. Anukari is extremely careful about how it interacts with the GPU, and when a particular GPU is not usable, it should (preferred) simply pick a different GPU, or at the very least, show a helpful message in the GUI explaining the situation.

Unfortunately it was difficult to debug this issue remotely. The user was kind enough to run an instrumented binary that confirmed that Anukari was calling clBuildProgram() with perfectly valid arguments, and it was simply aborting. I really needed to run Anukari under a debugger to learn more.

So I found out what laptop my bug-reporting user had, and ordered an inexpensive used Lenovo Ideapad 5 on eBay. I've had to buy a lot of testing hardware, and I've saved thousands of dollars by buying it all second-hand or refurbished. In this case it did take two attempts, as the first Ideapad 5 I received was super broken. But the second one works just fine.

Investigation

After getting the laptop set up and running Anukari under the MSVC debugger, I instantly was seeing debug output like this just prior to the driver crash:

LLVM ERROR: Cannot select: 0x1ce8fdea678:
ch = store 0x1ce8fe462a8, 0x1ce8fde8fb8, 0x1ce8fe470b8,
  undef:i32
  0x1ce8fde8fb8: f32,ch = load 0x1ce8fdd7638, 0x1ce8fde8b80,
  undef:i64
    0x1ce8fde8b80: i64 = add 0x1ce8fcbc600, Constant:i64<294>
      0x1ce8fcbc600: i64,ch,glue = LD_64 
        TargetExternalSymbol:i64'arguments', Register:i64 %noreg, 
        TargetConstant:i32<0>, TargetConstant:i32<4>,
        TargetConstant:i32<4>, TargetConstant:i32<8>,
        TargetConstant:i32<34>, TargetConstant:i32<1>, 0x1ce8a26ec90
        0x1ce8fce97f8: i64 = TargetExternalSymbol'arguments'
        0x1ce8fcc3148: i64 = Register %noreg
        0x1ce8fcbc330: i32 = TargetConstant<0>
        0x1ce8fcbbfe8: i32 = TargetConstant<4>
        0x1ce8fcbbfe8: i32 = TargetConstant<4>
        0x1ce8fcbbe08: i32 = TargetConstant<8>
        0x1ce8fcbc768: i32 = TargetConstant<34>
        0x1ce8fcc1f58: i32 = TargetConstant<1>
      0x1ce8fde8b08: i64 = Constant<294>
    0x1ce8fcc2c20: i64 = undef
  0x1ce8fe470b8: i32 = add FrameIndex:i32<30>, Constant:i32<294>
    0x1ce8fdea420: i32 = FrameIndex<30>
    0x1ce8fe47040: i32 = Constant<294>
  0x1ce8fdea498: i32 = undef

First of all, I want to call out AMD on their exceptionally shoddy driver implementation. It's just absurd that they'd allow a compilation error internal to the driver to abort the whole process. Clearly in this case clBuildProgram() should return CL_BUILD_PROGRAM_FAILURE, and the program log (the compiler error text) should be filled with something helpful, at a minimum, the raw LLVM output, but preferably something more readable. This is intern-level code, in a Windows kernel driver. Wow.

After reading through this carefully, all I could really make of it was that LLVM was unable to find a machine instruction to read data from this UpdateEntitiesArguments struct in addrspace=7 and write it to memory in addrspace=1. From context, I could guess that addrspace=1 is private (thread) memory, and addrspace=7 is whatever memory the kernel arguments are stored in. I had a harder time understanding why it couldn't find such an instruction. I thought maybe it had to do with an alignment problem, but wasn't sure.

This struct contains a number of fields, and I couldn't tell from the error which field was the problem. So I just used a brute-force approach and commented out most of the kernel code, and added code back in slowly. It compiled fine until I uncommented a line of code like float x = arguments.field[i]. I did some checking to ensure that field was aligned in a sane way, and after confirming that, I came to the conclusion that the gfx90c chip simply does not have an instruction for loading memory from addrspace=7 with a dynamic offset. In other words, the gfx90c appears to lack the ability to address arrays in argument memory with a non-constant offset.

Which, as far as I can tell, means that the gfx90c really doesn't support OpenCL properly. Every other OpenCL implementation I've used can do this, including NVIDIA, Intel Iris, Apple, and even newer Radeon chips like the gfx1036. I don't see anything in the OpenCL specification that would indicate that this is a limitation.

But even assuming that it's somehow within specs for an OpenCL implementation not to support this feature, obviously aborting in the driver is completely unreasonable behavior. Again, this is a really shoddy implementation, and when people ask about why Anukari doesn't yet officially support Radeon chips, this is the kind of reason that I point to. The drivers are buggy, and worse they are inconsistent across the hardware.

The (very simple) workaround

Anyway, I have very good (performance) reasons for storing some small constant-size arrays (with dynamic indexes) in kernel arguments, but those reasons really apply more to the CUDA backend. So I made some simple changes to Anukari to store these small arrays in constant device memory, and the gfx90c implementation now works just fine.

Given that I upgraded my primary workstation recently to a very new AMD Ryzen CPU, I now have two Radeon test chips: the gfx90c in the Ideapad 5, and the gfx1036 that's built-in to my Ryzen. The Anukari GPU code appears to work flawlessly on both, though doesn't perform all that well on either. Next up will be doing more testing of the Vulkan graphics, which have also been a pain point in the past on Radeon chips.

by Evan at 7/19/2025, 9:48:15 PMgpu bug radeon

Multichannel, ASIO, Radeon, and randomization

Captain's Log: Stardate 79000.1

Whoa, it's been way too long since I updated the devlog. Here goes!

2025 MIDI Innovation Awards

Really quickly: Anukari is an entry in the 2025 MIDI Innovation Awards, and I would really appreciate your vote. You can vote on this page by entering your email and then navigating to the Software Prototypes/Non-Commercial Products category and scrolling way down to find Anukari. You have to pick 3 products to vote in that category. (I wish I could link to the vote page directly, but alas, it's not built that way.)

The prize for winning would be a shared booth for Anukari at the NAMM trade show, which would be a big deal for getting the word out.

Multichannel I/O Support

A while back, Joe Williams from CoSTAR LiveLab reached out to me asking if Anukari had multichannel output support. Evidently the UK government is investing in the arts, which as an American is a pretty (literally) foreign concept. One of the labs working on promoting live performance is LiveLab, and they have a big 28-channel Ambisonic dome. Joe saw Anukari and thought it would be cool to create an instrument with 28 mics outputting to those 28 speaker channels.

I'd received several requests for multichannel I/O, but hadn't yet prioritized the work. The LiveLab use case is really cool, though, and Anukari will be featured in a public exhibit later this month, so I decided to prioritize the multichannel work.

Anukari now supports 50x50 input/output channels. In the standalone app, this is really simple, you just enable however many channels your interface supports and then inside Anukari you assign each audio input exciter or mic to the channels you want.

It also works for the plugin, but how you utilize multichannel I/O is very DAW-dependent. Testing the new feature was kind of a pain in the butt, because I have about 15 DAWs for testing, and multichannel is a bit of an advanced feature, so I ended up watching a zillion tutorial videos. Every DAW approaches it a bit differently, and the UX is generally somewhat buried since it's a nice feature. But it works everywhere, and it is extremely cool to be able to map a bunch of mics to their own DAW tracks and give them independent effects chains and so on.

Behind the scenes, it was really important to me that the multichannel support did not impact performance, especially when it was not in use. I'm very happy to say I achieved this goal. When you're not using multichannel I/O, there is zero performance impact. And even in 50x50 mode the impact is very low. Anukari is well-suited for multichannel I/O since each mic is tapping into the same physics simulation at different points/angles, so none of the physics computations have to be repeated/duplicated. Really the only overhead is copying additional buffers into and out of the GPU. On the Windows CUDA backend, that's a single DMA memcpy, which is very fast. And on the macOS Metal backend, it's unified memory, so no overhead at all. All that remains is the CPU-CPU copy into the DAW audio buffers, which is very, very fast.

I look forward to posting about the LiveLab exhibit once it happens.

ASIO Support

It's a pretty big oversight that the Windows version of the Anukari Beta launched without ASIO support. I'm not quite sure how I missed this important feature, but I've added it now.

I think I always assumed it was there, but when using JUCE the ASIO support is not enabled by default because you need to get a countersigned agreement from Steinberg to use their headers to integrate with ASIO. I already had a signed agreement with them for the VST3 support, but ASIO is a completely separate legal agreement and so I went through the steps to get that as well.

ASIO support makes the standalone app perform much better (in terms of latency) for people with ASIO compatible sound interfaces.

AMD Radeon Crashes

Officially speaking, Anukari explicitly does not support AMD Radeon hardware. This is a bit of a long story, which at some point I will write about in more detail. But the short version is that the Radeon drivers are incredibly inconsistent across the Radeon hardware lineup, which makes it extremely difficult to offer full support. For some Radeon users, Anukari works perfectly, and for others it is unstable, glitchy, or crashes, in many different unique ways.

The story I'll write about for this devlog entry, though, is the extremely frustrating case that I solved for users that have both an AMD Radeon and an NVIDIA graphics card in the same machine. This is actually a common situation, because many (most? all?) AMD Ryzen CPUs include integrated Radeon graphics on the CPU die. So for example there are a lot of laptops that come with an NVIDIA graphics card, but also have a sort of "vestigial" Radeon in the CPU that is normally not used for anything.

In the past, Anukari just worked for users with this configuration, since when it detects multiple possible GPUs to use for the simulation, it would automatically select the CUDA one as the default. However in the 0.9.6 release, Anukari began crashing instantly at startup for these users.

This was pretty confusing, because I have comprehensive fuzz and golden tests that exercise all the physics backends (CUDA, OpenCL, Metal). These tests abuse the simulation to an extreme extent, and I run them under various debugging/lint tools to make sure that there are no GPU memory errors, etc. And across my NVIDIA, macOS, and Intel Iris chips, they all work perfectly.

Luckily I had a user who was extremely generous with their time to help me debug the issue. I sent them instrumented Anukari binaries, and eventually was able to pinpoint that it was crashing inside the clBuildProgram() call.

Now, you might think that what I mean is that clBuildProgram() was returning an error code, and I was somehow not handling it. No, Anukari is extremely robust about error checking. I mean it was crashing inside the kernel and clBuildProgram() was not returning at all due to the process aborting. This is with perfectly valid arguments to the function. So, obviously, this is a horrible bug in the AMD drivers. Even if the textual content of the kernel has e.g. a syntax error, clearly clBuildProgram() should return an error code rather than crash.

The really fun part is that I've only seen this crash on the hardware identifying as gfx90c. On other Radeons, this does not happen (though some of them fail in other ways). This is what I mean about the AMD drivers being extremely inconsistent.

Now, as to why this crash happened at startup, it's because during device discovery Anukari was compiling the physics kernel, and any device where compilation failed would be assumed incompatible and omitted from the list of possible backends. I added this feature after encountering other broken OpenCL implementations like the Microsoft OpenCL™, OpenGL®, and Vulkan® Compatibility Pack which is an absolute disaster.

So the workaround for now is that Anukari no longer does a test compilation to detect bad backends. This resolves the issue, although if the user manually chooses the Radeon backend on gfx90c it will unrecoverably crash Anukari.

Longer-term, given the Radeon driver bugs, I doubt I'll ever be able to fully support gfx90c, but I ordered a cheap used laptop off eBay with that chip in it so that I can at least narrow down what OpenCL code is causing the driver to crash. I know that it's something the driver doesn't like about the OpenCL code because it did not always crash, and the only difference in the meantime has been some improvements to that code. Hopefully I can find a workaround to avoid the driver bug, but if not I might add a rule in Anukari to ignore all gfx90c chips.

(Side-note: actually the first used laptop with a gfx90c chip that I bought off eBay was bluescreening at boot, so I had to buy a second one. These inexpensive Radeon laptops are really bad.)

Not all hope is lost for Radeon support. I recently upgraded my main development machine, and the Ryzen CPU I bought has an on-die Radeon, and it works flawlessly with Anukari. So maybe what I will be able to do one day is create an allow-list for Radeon devices that work correctly without driver issues. Sigh. It is so much easier with NVIDIA and Apple.

Parameter Randomization

Unlike the features above, this one hasn't been released yet, but I recently completed work to allow parameters to be randomized.

For Anukari this turned out to be a bit of a design challenge, since the sliders that are used to edit parameters are a bit complex already. The tricky bit is that if the user has a bunch of entities selected, the slider edits them all. And if the parameter values for each entity vary, the slider turns into a "range editor" which can stretch/squeeze/slide the range of values.

So the randomize button needs to handle both the "every selected object has the same parameter value" and "the parameter varies" scenarios. For the first scenario with a singleton value, it's simple: pressing the button just picks a random value across the full range of the parameter and assigns it to all the objects.

But for the "range editor" scenario, what you really want is for the randomize button to pick different random values for each entity, within the range that you have chosen. There's one tricky issue here, which is that it is very normal for the user to want to mash the randomize button repeatedly until they get a result they like. This will result in the range of values shrinking each time (since it's very unlikely that the new random values will have the same range as before, and the range can only be smaller)!

So the slider needs to remember the original range when the user started mashing the randomize button, and to reuse that original range for each randomization. This allows button mashing without having the range shrink to nothing. It's important, though, that this remembered range is forgotten when the user adjusts the slider manually, so that they can choose a new range to randomize within.

Another kind of weird case is when the slider is currently in singleton mode, meaning that all the entities have the same parameter value, and the user wants to spread them out randomly over a range. This could be done by deselecting the group of entities, selecting just one of them, changing its value, then reselecting the whole group, which would put the slider into range mode. But that's awfully annoying.

I ended up adding a feature where you can now right click on a singleton slider, and it will automatically be split into a range slider. The lower/upper values for the range will be just slightly below/above the singleton value, and the values will be randomly distributed inside that range. So now you can just right click to split, adjust the range, and mash the randomize button.

by Evan at 7/14/2025, 8:54:03 PMgui ux gpu radeon multichannel bug

Huge macOS performance improvements

Captain's Log: Stardate 78871.1

Yesterday I pushed out a 0.9.6 pre-release version of Anukari that has some extremely promising performance improvements for macOS. On my local test machines, it's working flawlessly, but I'm hesitant to declare victory until I hear back from a few more users.

A big thank you to Apple

I am not an Apple shill. Some of my past devlog entries show some, shall we say, "mild frustration" with Apple.

But I want to give credit where credit is due, and Apple has been incredibly over-the-top helpful with Anukari's performance problems.

The specific engineers I've been speaking with have been slammed with prep work for WWDC, and I know what that's like from working at Google when I/O is coming up. Yet they still have spent an inordinate amount of time talking to me, answering questions, etc. So: Apple folks, you know who you are, and thank you!

The especially good news here is that Apple isn't just spending this time with me to make Anukari work well, but is using this as an opportunity to improve the Metal APIs for all similar latency-sensitive use cases. I got lucky that I was in the right place at the right time, and Apple saw an opportunity to work super closely with an outside developer who cared a LOT about latency.

Lower-latency kernel launches and waits

Based on Apple's help up to this point, the GPU performance state problems I was having are no longer an issue. But there were still some cases where I was not satisfied with Anukari's macOS performance.

Using Metal's GPU timestamps, I came to the conclusion that while the actual runtime of the kernel is pretty stable now. The GPU is in a consistent performance state, and the GPUEndTime - GPUStartTime duration is also consistent.

However, looking at the end-to-end duration of encoding the MTLCommandBuffer, to telling the kernel to launch (MTLCommandBuffer commit), to receiving the results back on the CPU (MTLCommandBuffer waitUntilCompleted), it was a bit longer than I'd like, but more importantly it was really inconsistent. There was a lot of jitter.

I spoke with an Apple engineer, and they suggested that I try using MTLSharedEvent to trigger both the launch of the kernel, as well as to wait for the result. Basically the idea was to do the encoding work for the next command buffer on the CPU while waiting for the previous one to finish. The next buffer would include a command to block for an event before starting the kernel, and a command to signal another event after it finished.

The part about doing the encoding work while waiting for the previous buffer is a no-brainer. The encoding work takes around 50us and all of that can be saved by doing it in parallel with the GPU work. I had considered this before, but at that time 50us was not worth the effort. Now, though, I've cut out so much other slack that it was worth it. There was a bit of complexity having to do with parameters for the kernel that are unknown at the time of encoding -- these had to move to buffers that get written later when the parameters become known. But overall this was pretty straightforward.

However the part about using an MTLSharedEvent waitUntilSignaledValue for the CPU to block on kernel completion didn't seem as obvious to me. Using MTLCommandBuffer waitUntilCompleted seemed like basically the same thing to me. But implementing this was even easier than the command double-buffering, so of course I tried it out. And I'm glad I did, because as Apple predicted, it had much lower latency. Clearly the OS/firmware services these two blocking calls in different ways, and for whatever behind-the-scenes reason, the MTLSharedEvent version works way better.

So I would definitely recommend to anyone trying to achieve super low-latency kernel launches: use MTLSharedEvent both to start the kernel and to wait for it to finish, and use double-buffering to prepare each command buffer on the CPU while the previous one is running on the GPU. It makes a big difference. I am now seeing < 50us of scheduling/waiting overhead.

But nothing is that simple

After making all these changes to improve Anukari's macOS performance, I went back to the CUDA backend on Windows and found that I had severely slowed it down. About 90% of the GPU code is shared between Metal/CUDA/OpenCL, so when I change things it often affects performance on all three backends in different ways. It's a bit like putting a fitted sheet on a bed -- you get one corner on, but then when you pull the next corner on, the previous corner slips back off.

(As an aside, this is the reason Anukari does not yet support Linux. It's a lot of work to deal with the fitted sheet issue for two platforms already. Linux support will come after Windows and macOS are super performant and stable.)

After some git bisecting, I found that the CUDA performance regression was caused by moving some of the kernel's parameters to a device memory buffer. This was part of moving the Metal implementation to encode the next command in parallel with the current one: parameters like the audio block size that aren't known until the next audio block is being processed can't be kernel parameters, but rather have to be written to device memory later, after the kernel has been committed but before the MTLSharedEvent is signaled to start kernel execution.

The kernel parameters in question are tiny, perhaps 64 bytes in total. On macOS this buffer is marked as immutable, and I saw zero performance degradation from having the kernel read it from device memory as opposed to receiving it as arguments.

However on CUDA, there was a huge performance loss from this change. Reading these 64 bytes of parameters from __constant__ device memory in every kernel thread caused the overall kernel to run 20-30% slower, as compared to passing the data directly as kernel parameters.

I ruled out the extra cuMemcpyHtoDAsync call for the new memory segment causing the increased latency. Careful timing showed that the extra time really was really in the kernel execution.

I don't know why this is so much slower, but I have a hypothesis, which is that when the constant data is passed via a kernel parameter, rather than __constant__ device memory, CUDA is capable of doing some kind of inlining that is otherwise not possible. For example, maybe it can rewrite some of the kernel instructions to use immediate operands instead of registers. Or possibly it is actually unrolling some loops at runtime. (Something to experiment with.)

Anyway the solution here was simply to bifurcate the GPU backends a little bit and have the CUDA backend pass the arguments directly to the kernel, rather than using a __constant__ buffer. This works just fine for now, but I may eventually need to find an alternative if I want to apply the same background encoding concepts that I've been successful with on Metal to the CUDA backend.

For now, things are fast again and I'm going to move on to some other work before taking a further pass at improving the CUDA backend. My findings here do suggest that there may be other CUDA optimizations possible by trying to move more data into direct kernel parameters instead of __constant__ buffers. Given the performance difference, I will definitely experiment with this.

And hey, some new features!

While performance is still my main focus, I did take a couple breaks from that to implement some new features.

The one that I'm by far the most excited about is that Anukari's modulation system now has a target called MIDI Note Trigger, which allows any object that can be triggered via MIDI notes to also/instead be triggered through modulation.

The way it works is that when a modulation signal goes from below 0.5 to above 0.5, that's a note on event. And when it drops from above 0.5 to below 0.5, that's a note off.

This is extremely simple, but it opens up a gigantic range of possibilities. LFOs can now trigger envelopes. Envelope triggers can now be used to delay note on events to create arpeggiator effects. Mallets can be triggered at insane speeds. Envelope followers can be used to trigger oscillators when an audio input signal gets loud enough. The list goes on.

by Evan at 5/28/2025, 6:27:06 PMmacos metal gpu optimization