Captain's Log: Stardate 78143.7

A couple of days ago I started work on the Metal port in earnest. Knock on wood, but I think it might go faster than I originally anticipated.

The first thing I had to do was to split the OpenCL simulator apart into the pieces that could be generalized to any GPU-based backend, and the parts that were OpenCL-specific. Fortunately, I've long known that I'd be doing this, so I had designed things with this in mind. A pretty large chunk of code was already pretty general, and the remaining code was fairly easy to cut apart. This is completely finished -- there is now a GpuSimulatorBase class with two children, OpenCLSimulator and MetalSimulator. The OpenCLSimulator passes all the golden/fuzz tests, so I'm pretty confident that it works.

For the MetalSimulator, my goal right now is to hack together a minimal working implementation before going on to professionalize it. So far, it appears that I might end up with only 1k lines of c++ code specific to the Metal simulator, which is way better than I expected. A super-hacky implementation is about half-done. It loads and attempts to compile the OpenCL code, but the kernel arguments aren't all wired up, etc.

Now the OpenCL code is the most interesting bit. That's about 2.5k lines of very, very-carefully written C code, which I really don't want to duplicate. Now that I can see the Metal compiler errors from pretending that it's Metal Shader Language, I am pretty sure that I am not going to have to duplicate it. I think that I will be able to get away with using some abominable, horrific, dirty macros to make the GPU code compile for OpenCL, Metal, and (later) CUDA.

The differences really aren't that large. The way that pointers are declared for private/global/local memory is different. Some of the built-in functions are a little different. Some of the custom syntax for non-C things like initializing a float3 are different. But so far these all look like things that I can macro my way around.

I'm really hoping this is possible, because if so, the Metal (and CUDA) port will go much faster than I thought, and also will impose far less ongoing friction from having to maintain all the different GPU platforms, because most of the GPU code will be unified.

Anyway, the question still remains about whether Metal will help with the performance. I'm a bit skeptical, but today I encountered some reason for optimism. This Apple doc about memory types talks about the different kinds of memory mapping. After reading it in detail, I have some speculative ideas on what Apple might be doing in their OpenCL implementation that would not be optimal for Anukari. In particular, I suspect that they're using Shared memory mode for some of the buffers where Managed mode will be much better (or even blitting to Private memory).

But of course I don't know that for certain. At any rate, I'll be very happy to have complete control over the memory mapping, if only to rule it out as a problem. But I'm a bit hopeful that there will be some huge speedups by setting things up in a better way for Anukari's workload.


© 2024 Anukari LLC, All Rights Reserved
Contact Us|Legal