devlog

Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Starting to micro-optimize, and 3D artwork

Captain's Log: Stardate 78310.9

First things first

Before doing any kind of subtle optimization, the first thing is of course to make sure that you can actually measure whether your so-called "optimizations" make things faster. Often things that seem like they should be faster either don't do anything, or even make things worse.

So to start, I added a simple timing framework to my goldens tests, which exercise every physics feature that Anukari supports. Now when I run those 70-or-so tests, each one computes a timing benchmark and stores it in a CSV file along with the current time, the GPU backend that was used for the simulation, some machine ID info, etc.

This gives me the ability to graph the overall change in runtime across all the golden tests, as well as look at each individual test to see which ones were sped up or slowed down.

I decided for now to just check this CSV file into git. This is really simple, but kind of cool: each time I make a micro-optimization, I commit the code along with the updated profiling stats in the CSV file. So I have a nice history of how much impact each code change had.

Micro-optimizing

While Anukari's physics simulation runs to my satisfaction on Windows (with a decent graphics card), it is still dismayingly slow on MacOS. It works for small presets, but there's little margin, and for example in Garage Band you get buffer overruns all the time (not sure why yet, but Garage Band seems to have a lot more overhead compared to other DAWs).

One huge advantage of porting the physics to Metal, aside from the speedups I've already seen from optimizing the way the memory buffers are configured, is that I can get meaningful results out of Apple's Metal GPU profiling tools. The most immediate finding is that the simulation is NOT primarily memory-bound, and rather the long pole is ALU saturation. This is pretty surprising, but I did a lot of work early on to make the memory layout efficient and compact, so it is believable. And I've also added a lot of physics features that require computation, so it checks out.

Thus I've begun micro-optimizing the code. This is a little frustrating, as Apple has a tendency to put all their good Metal information in videos instead of documents, and I hate videos for technical information like this. Fortunately they have transcripts, but that's not perfect, as they're obviously machine-generated, and also the speakers are often referencing slides I can't see. But as disorganized and poorly-formatted as Apple's guidance here is, they do provide some juicy nuggets.

One thing I'm doing is looking for places where I can replace floats with half-floats, because the ALU has double the bandwidth for halfs. That's pretty obvious, but it's tricky because for most physics features, halfs have way too little precision. But there are opportunities, for example with the envelope attack duration, I'm pretty sure half precision is still far beyond human perception.

Another thing I'm doing is trying to figure out how to get rid of 64-bit longs from the computations. I don't use them much, but the global sample number is a long (32 bits worth of samples is less than 1 day, so it could easily overflow). I also use longs for my GPU-based PRNG. In both cases I think it will be possible to avoid them, but this will require some thought.

There are a ton of other little things, like reordering struct fields so that Apple's compiler can automatically vectorize the loads, getting rid of a pesky int mod operation which can take hundreds of cycles, and switching array indexes from unsigned to signed ints. The latter is quite interesting, apparently Apple's fastest memory instructions depend on the guarantee that the offset can be represented by a signed int, and if you use an unsigned int the compiler can't make that guarantee and has to fall back to something slower.

3D Artwork

I've started to reach out to a few 3D artists about redoing all the 3D models for the various instrument components. I'm looking forward to getting professional models, as I think it will make the whole plugin much more beautiful and interesting to look at.

This will also be an opportunity to change things so that the animations are part of the 3D models themselves (using bone skinning) instead of hard-coded into the Anukari engine. This will be an advantage when users start creating (or commissioning) their own 3D models for the engine: they won't be limited to the built-in animations. They can do anything they want.

Anyway after talking to some artists, I quickly realized that I have vastly under-specified what I actually want things to look like. So I've spent a fair bit of time working on a visual style document, where I include a bunch of reference images for what I want each thing to look like, along with commentary on what I'm going for, etc. In some cases I'm hand-drawing the weird stuff I want because I can't find anything similar online.

This is a lot of work, but it's really fun because I'm finally having to think about exactly what I want things to look like. Today, the MIDI automations, LFOs, etc, are all super simple and most of them are copied from the same knob asset. But my plan is to make every single kind of automation a unique 3D thing, hopefully with a visual and animation that conveys its purpose, at least to the extent that I can imagine ways to do so.

by Evan at 11/5/2024, 3:10:32 AMoptimization gpu graphics

newer postarchiveolder post