devlog

Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer

Captain's Log: Stardate 78146.3

Surprisingly, today I got Anukari running on Metal. It turned out that modifying the OpenCL code so that it could be run via OpenCL or Metal was a lot simpler than I expected. The macros are not all that complicated, and the code is certainly uglier in some places, but for the most part it's not too bad. It took me a while to figure out how Metal does indexing for kernel arguments (device memory and threadgroup memory have different index spaces, for example), but that was the worst of it.

It works well enough to pass basically all of the golden tests. Which is very surprising. Actually it fails a few, but they're the same few that the OpenCL implementation on MacOS fails -- for whatever reason they are extra sensitive to whatever differences there are between the M1 chip and my NVIDIA chip on Windows. So from an audio correctness standpoint, things seem to be working.

I don't yet have a good read on the performance. I slapped the rough draft implementation together very quickly, and didn't take the time yet to read through the memory allocation/ownership rules that Cocoa / Cocoa Touch use, which means that my implementation leaks memory like a sieve which is causing lots of issues. I suspect that there are a bunch of other small things I've done wrong that affect performance as well.

But from what I've seen so far, I don't think a straight port to Metal will automatically answer my performance prayers. I'll have to get it working properly, and then start experimenting with how I might be able to take better advantage of what Metal has to offer. And hopefully the instrumentation/profiling tools will work a lot better to help me with that.

Captain's Log: Stardate 78143.7

A couple of days ago I started work on the Metal port in earnest. Knock on wood, but I think it might go faster than I originally anticipated.

The first thing I had to do was to split the OpenCL simulator apart into the pieces that could be generalized to any GPU-based backend, and the parts that were OpenCL-specific. Fortunately, I've long known that I'd be doing this, so I had designed things with this in mind. A pretty large chunk of code was already pretty general, and the remaining code was fairly easy to cut apart. This is completely finished -- there is now a GpuSimulatorBase class with two children, OpenCLSimulator and MetalSimulator. The OpenCLSimulator passes all the golden/fuzz tests, so I'm pretty confident that it works.

For the MetalSimulator, my goal right now is to hack together a minimal working implementation before going on to professionalize it. So far, it appears that I might end up with only 1k lines of c++ code specific to the Metal simulator, which is way better than I expected. A super-hacky implementation is about half-done. It loads and attempts to compile the OpenCL code, but the kernel arguments aren't all wired up, etc.

Now the OpenCL code is the most interesting bit. That's about 2.5k lines of very, very-carefully written C code, which I really don't want to duplicate. Now that I can see the Metal compiler errors from pretending that it's Metal Shader Language, I am pretty sure that I am not going to have to duplicate it. I think that I will be able to get away with using some abominable, horrific, dirty macros to make the GPU code compile for OpenCL, Metal, and (later) CUDA.

The differences really aren't that large. The way that pointers are declared for private/global/local memory is different. Some of the built-in functions are a little different. Some of the custom syntax for non-C things like initializing a float3 are different. But so far these all look like things that I can macro my way around.

I'm really hoping this is possible, because if so, the Metal (and CUDA) port will go much faster than I thought, and also will impose far less ongoing friction from having to maintain all the different GPU platforms, because most of the GPU code will be unified.

Anyway, the question still remains about whether Metal will help with the performance. I'm a bit skeptical, but today I encountered some reason for optimism. This Apple doc about memory types talks about the different kinds of memory mapping. After reading it in detail, I have some speculative ideas on what Apple might be doing in their OpenCL implementation that would not be optimal for Anukari. In particular, I suspect that they're using Shared memory mode for some of the buffers where Managed mode will be much better (or even blitting to Private memory).

But of course I don't know that for certain. At any rate, I'll be very happy to have complete control over the memory mapping, if only to rule it out as a problem. But I'm a bit hopeful that there will be some huge speedups by setting things up in a better way for Anukari's workload.

Captain's Log: Stardate 78129.1

I've been continuing to receive a ton of valuable feedback from the pre-alpha. The latest lesson I learned is that OpenCL features vary between MacOS versions. I cut a release the other day that worked great for me, and immediately broke for everyone using older MacOS versions (I'm on the latest). Thankfully the issue was clear from the logs of one of the users who had the problem.

Anyway, the performance optimization I wrote about in the last post went out and has been very successful. My benchmarking shows that it reduced the time spent copying data between the CPU and GPU to a level that's insignificant compared to the time it takes to run the simulation kernel. This is great because it means there's no further optimization to do on the data-copying front. All the remaining work has to be in the kernel itself.

The state of affairs with the kernel is that on Windows, particularly with NVIDIA hardware, it runs great. You can run some very large presets (example) and the latency is pretty predictable. I want to improve things further here, and I believe the eventual CUDA port will help, but performance-wise it's pretty fun. (Stability still is an issue.)

However, on MacOS, things are in poorer shape. Despite cutting the memory latency out, the kernel is just a lot slower than on Windows. I've done some basic profiling using Instruments, and the best I can tell is that it is memory-bound, which offhand makes sense, but I think something fishy is happening. The best I could figure, the memory bottleneck was reading links (springs, etc). But in the preset I was benchmarking, the links only took up 11 KB of memory. Originally I imagined that this would fit in L1 cache, but while Apple's documentation is crap, I have found other sources that claim the M1 L1 cache is 8 KB. So that could really be a problem. Still, even smaller presets are slower than I'd expect, with links that would fit in 8 KB.

I also ran some experiments with ripping enough data out of some of the other structures in global memory so that I could reduce alignment from 256 bytes to 128 bytes, which you'd imagine might speed things up a lot. It didn't really do anything. This kind of confirmed that it wasn't these data structures that were problematic, but it left me wondering why not. Of course the performance is ultimately based on the holistic situation with memory reads, but it really seems like links are the biggest issue at the moment.

So I'm quite confused, and at the moment I am leaning towards starting the work to port the MacOS version to Metal instead of OpenCL, so that I will have total control over what's happening. It seems quite likely to me that Apple's OpenCL implementation is crap, given that they've deprecated it and really don't want folks using it -- why would they continue optimizing it? I don't see Metal as a silver bullet, but at least the profiling tools will work better, and there will no longer be any mysteries about what's happening.

One final note is that it appears that newer MacOS versions perform a lot better for OpenCL. Which is weird, and contradicts what I just wrote about Apple not maintaining it. I am curious if this means that the Metal implementation has improved, or what. I guess I'll find out.

Loading...

© 2024 Anukari LLC, All Rights Reserved
Contact Us|Legal