devlog > metal
Captain's Log: Stardate 78195.4
Today I finished tidying up a few loose ends from the work I did to allow multiple simulation backends (OpenCL, Metal, eventually CUDA). The main thing here was to parameterize some of the unit tests, such as the fuzz test, so that they would run against all available backends on each OS. I haven't parameterized the golden tests yet, but that's something I'll definitely do at some point.
After that, I continued work on optimizing the Metal backend. I have some changes that look fairly promising when I run isolated benchmarks, but then when running the full app the performance gains don't appear. This is interesting.
Right now my best guess for what's going on is that the MacOS OpenGL implementation is doing weird/bad stuff behind the scenes. On Windows I've established that the 3D graphics don't interfere in any measurable way with the audio thread's use of the GPU. But on MacOS there does seem to be interference. But it's not related to how much computation is happening -- the interference appears to be there even if Anukari doesn't actually draw any pixels. This is what makes me think that Apple's OpenGL implementation is bad.
So I'd like to rule out weird OpenGL issues as the cause for MacOS slowness. Since I eventually need to port the graphics to Metal, I am going to begin work on that now. There's no guarantee it helps with audio performance, but it might, and anyway I have to do it. Thus today I began integrating with the Google Filament library that I'm planning to use for cross-platform graphics.
Captain's Log: Stardate 78146.3
Surprisingly, today I got Anukari running on Metal. It turned out that modifying the OpenCL code so that it could be run via OpenCL or Metal was a lot simpler than I expected. The macros are not all that complicated, and the code is certainly uglier in some places, but for the most part it's not too bad. It took me a while to figure out how Metal does indexing for kernel arguments (device memory and threadgroup memory have different index spaces, for example), but that was the worst of it.
It works well enough to pass basically all of the golden tests. Which is very surprising. Actually it fails a few, but they're the same few that the OpenCL implementation on MacOS fails -- for whatever reason they are extra sensitive to whatever differences there are between the M1 chip and my NVIDIA chip on Windows. So from an audio correctness standpoint, things seem to be working.
I don't yet have a good read on the performance. I slapped the rough draft implementation together very quickly, and didn't take the time yet to read through the memory allocation/ownership rules that Cocoa / Cocoa Touch use, which means that my implementation leaks memory like a sieve which is causing lots of issues. I suspect that there are a bunch of other small things I've done wrong that affect performance as well.
But from what I've seen so far, I don't think a straight port to Metal will automatically answer my performance prayers. I'll have to get it working properly, and then start experimenting with how I might be able to take better advantage of what Metal has to offer. And hopefully the instrumentation/profiling tools will work a lot better to help me with that.
Captain's Log: Stardate 78143.7
A couple of days ago I started work on the Metal port in earnest. Knock on wood, but I think it might go faster than I originally anticipated.
The first thing I had to do was to split the OpenCL simulator apart into the pieces that could be generalized to any GPU-based backend, and the parts that were OpenCL-specific. Fortunately, I've long known that I'd be doing this, so I had designed things with this in mind. A pretty large chunk of code was already pretty general, and the remaining code was fairly easy to cut apart. This is completely finished -- there is now a GpuSimulatorBase class with two children, OpenCLSimulator and MetalSimulator. The OpenCLSimulator passes all the golden/fuzz tests, so I'm pretty confident that it works.
For the MetalSimulator, my goal right now is to hack together a minimal working implementation before going on to professionalize it. So far, it appears that I might end up with only 1k lines of c++ code specific to the Metal simulator, which is way better than I expected. A super-hacky implementation is about half-done. It loads and attempts to compile the OpenCL code, but the kernel arguments aren't all wired up, etc.
Now the OpenCL code is the most interesting bit. That's about 2.5k lines of very, very-carefully written C code, which I really don't want to duplicate. Now that I can see the Metal compiler errors from pretending that it's Metal Shader Language, I am pretty sure that I am not going to have to duplicate it. I think that I will be able to get away with using some abominable, horrific, dirty macros to make the GPU code compile for OpenCL, Metal, and (later) CUDA.
The differences really aren't that large. The way that pointers are declared for private/global/local memory is different. Some of the built-in functions are a little different. Some of the custom syntax for non-C things like initializing a float3 are different. But so far these all look like things that I can macro my way around.
I'm really hoping this is possible, because if so, the Metal (and CUDA) port will go much faster than I thought, and also will impose far less ongoing friction from having to maintain all the different GPU platforms, because most of the GPU code will be unified.
Anyway, the question still remains about whether Metal will help with the performance. I'm a bit skeptical, but today I encountered some reason for optimism. This Apple doc about memory types talks about the different kinds of memory mapping. After reading it in detail, I have some speculative ideas on what Apple might be doing in their OpenCL implementation that would not be optimal for Anukari. In particular, I suspect that they're using Shared memory mode for some of the buffers where Managed mode will be much better (or even blitting to Private memory).
But of course I don't know that for certain. At any rate, I'll be very happy to have complete control over the memory mapping, if only to rule it out as a problem. But I'm a bit hopeful that there will be some huge speedups by setting things up in a better way for Anukari's workload.