devlog > optimization

Captain's Log: Stardate 78195.4

Today I finished tidying up a few loose ends from the work I did to allow multiple simulation backends (OpenCL, Metal, eventually CUDA). The main thing here was to parameterize some of the unit tests, such as the fuzz test, so that they would run against all available backends on each OS. I haven't parameterized the golden tests yet, but that's something I'll definitely do at some point.

After that, I continued work on optimizing the Metal backend. I have some changes that look fairly promising when I run isolated benchmarks, but then when running the full app the performance gains don't appear. This is interesting.

Right now my best guess for what's going on is that the MacOS OpenGL implementation is doing weird/bad stuff behind the scenes. On Windows I've established that the 3D graphics don't interfere in any measurable way with the audio thread's use of the GPU. But on MacOS there does seem to be interference. But it's not related to how much computation is happening -- the interference appears to be there even if Anukari doesn't actually draw any pixels. This is what makes me think that Apple's OpenGL implementation is bad.

So I'd like to rule out weird OpenGL issues as the cause for MacOS slowness. Since I eventually need to port the graphics to Metal, I am going to begin work on that now. There's no guarantee it helps with audio performance, but it might, and anyway I have to do it. Thus today I began integrating with the Google Filament library that I'm planning to use for cross-platform graphics.

Captain's Log: Stardate 78146.3

Surprisingly, today I got Anukari running on Metal. It turned out that modifying the OpenCL code so that it could be run via OpenCL or Metal was a lot simpler than I expected. The macros are not all that complicated, and the code is certainly uglier in some places, but for the most part it's not too bad. It took me a while to figure out how Metal does indexing for kernel arguments (device memory and threadgroup memory have different index spaces, for example), but that was the worst of it.

It works well enough to pass basically all of the golden tests. Which is very surprising. Actually it fails a few, but they're the same few that the OpenCL implementation on MacOS fails -- for whatever reason they are extra sensitive to whatever differences there are between the M1 chip and my NVIDIA chip on Windows. So from an audio correctness standpoint, things seem to be working.

I don't yet have a good read on the performance. I slapped the rough draft implementation together very quickly, and didn't take the time yet to read through the memory allocation/ownership rules that Cocoa / Cocoa Touch use, which means that my implementation leaks memory like a sieve which is causing lots of issues. I suspect that there are a bunch of other small things I've done wrong that affect performance as well.

But from what I've seen so far, I don't think a straight port to Metal will automatically answer my performance prayers. I'll have to get it working properly, and then start experimenting with how I might be able to take better advantage of what Metal has to offer. And hopefully the instrumentation/profiling tools will work a lot better to help me with that.

Captain's Log: Stardate 78129.1

I've been continuing to receive a ton of valuable feedback from the pre-alpha. The latest lesson I learned is that OpenCL features vary between MacOS versions. I cut a release the other day that worked great for me, and immediately broke for everyone using older MacOS versions (I'm on the latest). Thankfully the issue was clear from the logs of one of the users who had the problem.

Anyway, the performance optimization I wrote about in the last post went out and has been very successful. My benchmarking shows that it reduced the time spent copying data between the CPU and GPU to a level that's insignificant compared to the time it takes to run the simulation kernel. This is great because it means there's no further optimization to do on the data-copying front. All the remaining work has to be in the kernel itself.

The state of affairs with the kernel is that on Windows, particularly with NVIDIA hardware, it runs great. You can run some very large presets (example) and the latency is pretty predictable. I want to improve things further here, and I believe the eventual CUDA port will help, but performance-wise it's pretty fun. (Stability still is an issue.)

However, on MacOS, things are in poorer shape. Despite cutting the memory latency out, the kernel is just a lot slower than on Windows. I've done some basic profiling using Instruments, and the best I can tell is that it is memory-bound, which offhand makes sense, but I think something fishy is happening. The best I could figure, the memory bottleneck was reading links (springs, etc). But in the preset I was benchmarking, the links only took up 11 KB of memory. Originally I imagined that this would fit in L1 cache, but while Apple's documentation is crap, I have found other sources that claim the M1 L1 cache is 8 KB. So that could really be a problem. Still, even smaller presets are slower than I'd expect, with links that would fit in 8 KB.

I also ran some experiments with ripping enough data out of some of the other structures in global memory so that I could reduce alignment from 256 bytes to 128 bytes, which you'd imagine might speed things up a lot. It didn't really do anything. This kind of confirmed that it wasn't these data structures that were problematic, but it left me wondering why not. Of course the performance is ultimately based on the holistic situation with memory reads, but it really seems like links are the biggest issue at the moment.

So I'm quite confused, and at the moment I am leaning towards starting the work to port the MacOS version to Metal instead of OpenCL, so that I will have total control over what's happening. It seems quite likely to me that Apple's OpenCL implementation is crap, given that they've deprecated it and really don't want folks using it -- why would they continue optimizing it? I don't see Metal as a silver bullet, but at least the profiling tools will work better, and there will no longer be any mysteries about what's happening.

One final note is that it appears that newer MacOS versions perform a lot better for OpenCL. Which is weird, and contradicts what I just wrote about Apple not maintaining it. I am curious if this means that the Metal implementation has improved, or what. I guess I'll find out.

Loading...

© 2024 Anukari LLC, All Rights Reserved
Contact Us|Legal