devlog > macos

Waste makes haste...?

Captain's Log: Stardate 78324

I've been really digging into MacOS optimizations over the last few days. Being ALU-bound is quite a pain in the butt, because unlike being memory-bound, there are a lot fewer big changes I can make to speed things up. Mostly I've been working on instruction-level optimizations, none of which have had a big impact. I've gotten to where I don't see anything else that is really worth optimizing at this level. I've confirmed that by simply commenting out pieces of code and measuring the speedup from not running it, which gives me an upper bound on how much optimizing it could help. Reducing computation at this point is not going to be done at the instruction level.

I have a couple ideas for larger structural/algorithmic changes, but before I move on to this, I want to eliminate a few other issues that I've noticed with MacOS.

The biggest pain I've run into is the fact that MacOS gives the user very little (almost no) control over the power/performance state of the machine. I guess this is user-friendly, as Apple magically figures out how best to clock the CPU and GPU up and down to save power, but it turns out that Apple is doing this really badly for Anukari's use case.

From what I can tell, Apple's GPU clock throttling is based on something akin to a load average. This is a concept I originally ran into from Unix, where the load average is, roughly speaking, the average number of threads that are awaiting execution at any moment. If this number is higher than the number of CPU cores, it means that there's a backlog: there are regularly threads that want to run, but can't.

On the Apple GPU, it seems that the OS won't clock it up until there's a command queue with a high load average. Which makes sense for throughput-sensitive use cases, but doesn't work at all for a latency-sensitive use case. For example, for graphics, as long as the command queue isn't backing up (load average > 1), there's no need to up-clock the GPU: it is keeping up. But for Anukari, this heuristic doesn't work, because if it ever hits a load average even close to 1, that means that there's no wallclock time left over for e.g. the DAW to process the audio block, etc. It's already too late, and audio glitches are regularly occurring even at something like 0.8.

This is a serious problem. I asked on the Apple Developer Forums (optimistic, I know), and though I did get a thoughtful response, it wasn't from an Apple developer, and it didn't help anyway. I ran numerous experiments with queuing more simulation kernels in advance, but ultimately none of these ideas helped, because ultimately I can't really generate audio in advance because I need up-to-date MIDI input, etc.

Ultimately, the solution that I am going with is probably the stupidest code that I will have ever shipped to production in my career: Anukari is dedicating one GPU threadgroup warp to useless computation, simply to keep it fully saturated to communicate to the OS that the GPU needs to be up-clocked. So in this case, waste makes haste.

But while this is incredibly stupid, it is also incredibly effective. I got it working with the minimal amount of power usage by dedicating the smallest amount of compute possible to spinning (not even a whole threadgroup, just one warp). And it immediately gets the OS to clock up the GPU, which decreases the audio computation latency by about 40%.

My new golden test latency stats framework doesn't show as much of a gain as I see when using the plugin, because it was already running the GPU with smaller gaps, as it doesn't need to wait for real time to render the audio. But even in the tests, the performance improvement is dramatic. In this graph you can see the overall real time ratio for the tests. The huge drop at iteration 15 is where I implemented the GPU spin:

Here is a graph with all the tests broken out individually:

At this point, on my Macbook M1, the standalone app performance is very usable, even for quite complicated presets like the slinky-resocube.

But... of course nothing is ever simple, and somehow while these fantastic performance gains are very observable when running as a VST3 plugin in Ableton, they are not nearly as visible when running as an AU plugin in GarageBand. I don't know what is up here, but it needs to be the next performance issue that I address, because if I can get past this, I might have performance at a "good enough" level on MacOS to start spending my time elsewhere for a while.

Rendering on MacOS in Metal

Captain's Log: Stardate 78203.4

Today I got the new rendering approach working on MacOS, using an NSView that hovers over the main editor window. It works just like on Windows, with the right click pop-up menu correctly displaying on top (via another NSView). I have fully weeded-out OpenGL from the app, using Vulkan on Windows and Metal on MacOS. That means that I'm no longer using any APIs that Apple has deprecated, which was a big blocker for releasing the production version of the app.

The renderer currently does all the camera operations correctly, so you can zoom, rotate, use orthographic views, etc, just like with the old renderer, and it all works. However none of the entities are displayed -- it just loads the "broken helmet" glTF demo model and displays it. The fact that I can now load and render arbitrary glTF models is wonderful, because it means that I can now hire an artist for the 3D assets and get them exactly how I want them. With my custom renderer this would have been a lot trickier, since the artist would have to understand my formats.

Next I need to convert all my existing .obj models to .glTF, load them in Filament, instance them, and translate/rotate them into their correct positions for display. The other thing I need to do is rework the parts of the GUI that hover over the 3D window, since that's no longer possible (except via native windows, which have to be square). Both of these things are fairly straightforward, but may take a bit of time to get right.

Now the bad news: running the renderer in Metal did not fix the MacOS audio performance issues. This means that there's something really funny happening, because when I run the app in headless mode for golden tests, it performs much better. And it still performs poorly in GUI mode even if I disable the 3D renderer entirely, so it's not the 3D graphics interfering with the audio. I'm thinking the OS may have some weird heuristics about what kinds of processes to prioritize for GPU compute. So this is still an open area of investigation.

3D graphics / GPU audio interference?

Captain's Log: Stardate 78195.4

Today I finished tidying up a few loose ends from the work I did to allow multiple simulation backends (OpenCL, Metal, eventually CUDA). The main thing here was to parameterize some of the unit tests, such as the fuzz test, so that they would run against all available backends on each OS. I haven't parameterized the golden tests yet, but that's something I'll definitely do at some point.

After that, I continued work on optimizing the Metal backend. I have some changes that look fairly promising when I run isolated benchmarks, but then when running the full app the performance gains don't appear. This is interesting.

Right now my best guess for what's going on is that the MacOS OpenGL implementation is doing weird/bad stuff behind the scenes. On Windows I've established that the 3D graphics don't interfere in any measurable way with the audio thread's use of the GPU. But on MacOS there does seem to be interference. But it's not related to how much computation is happening -- the interference appears to be there even if Anukari doesn't actually draw any pixels. This is what makes me think that Apple's OpenGL implementation is bad.

So I'd like to rule out weird OpenGL issues as the cause for MacOS slowness. Since I eventually need to port the graphics to Metal, I am going to begin work on that now. There's no guarantee it helps with audio performance, but it might, and anyway I have to do it. Thus today I began integrating with the Google Filament library that I'm planning to use for cross-platform graphics.

Loading...

© 2024 Anukari LLC, All Rights Reserved
Contact Us|Legal