Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Railway.com knows better than you

Captain's Log: Stardate 79429.5

[EDIT1] Railway responded to this post in an HN comment. Quoting from their reply, "I wished the majority of our userbase knew better than us, but the reality is they don't."

[EDIT2] Railway's founder replied in a separate HN comment.

[EDIT3] Railway added an option to bypass the checks.

My honeymoon period with Railway.com for hosting this website is over.

I recently ran into a bug in my own code that caused some data corruption in my DB. The consequences of the bug were that a user who purchased Anukari could get emailed a license key that had already been sent to another user. This is really bad! Fortunately I noticed it quickly, and the fix was simple. I got the code prepared and went to do an emergency release to push out the fix.

It's worth noting that the bug had been around for a while, but it was latent until a couple days ago. So the fix was not a rollback, which would typically be the ideal way to fix a production issue quickly. The only solution was to roll forward with a patch.

Imagine my surprise when, during my emergency rollout, the release failed with:

==================================================================
SECURITY VULNERABILITIES DETECTED
==================================================================
Railway cannot proceed with deployment due to security
vulnerabilities in your project's dependencies.

Keep in mind that I'm super stressed about the fact that there is a severe and urgent problem with my app in production, which is causing me problems in real time. I've dropped everything else I was doing, and have scrambled to slap together a safe fix. I want to go to bed.

And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.

I'll get back to the actual security vulnerabilities Railway detected, but let's ignore that for the moment and talk about the fact that Railway has just intentionally blocked me from pushing out a release, based on their assessment of the security risk. But how can that possibly make sense? Railway cannot know how urgent my release is. Railway cannot know whether the security vulnerabilities they detected in my app are even exploitable. They're just looking at node package versions marked as vulnerable.

The most ridiculous part of this is that the current production version of my app that I was trying to replace depended on the same vulnerable package versions as the patched version I was trying to push out to fix my urgent bug. So in fact the release added zero additional security risk to what was already serving.

Okay so what were the security vulnerabilities that were so dangerous that Railway.com's nanny system engaged to block me from pushing out an urgent fix? They cited two HIGH risk CVEs: this one and this one. They also cited a MEDIUM risk CVE but I'm not going to cover that.

The HIGH risk CVEs are both DOS issues. Attackers can craft payloads that send the server into an infinite loop, sucking up CPU and denying service.

Do I want to fix those CVEs for my service? Absolutely! Would I like my hosting provider to shoot me an email if they detect that my app has those vulnerabilities? Yes, please!

Do I want to be blocked from pushing urgent fixes to my app when there's zero evidence that either of those CVEs are being exploited for my app in any way? I think not.

I want to push out my fix, go to bed, and then come back the next day and upgrade package versions and fix the CVEs.

Now, look, was it a huge deal to upgrade those packages? Not really. In this case I was lucky that it was a straightforward upgrade. But anyone who's used a package manager knows about dependency hell, where trying to upgrade one package leads to a cascade of other dependencies needing to be updated, possibly with conflicts in those dependencies. And even if there is no dependency hell, any time package versions are changed, some degree of testing is warranted to make sure nothing was broken.

These are things that I did not want to even remotely think about during an urgent incident late in the evening, especially not for package versions that were already live in production. It took a stressful experience and added a huge amount of additional frustration.

Railway sometimes uses the mottos, "Ship without friction," and "Let the engineers ship." This felt like friction. I was literally not allowed to ship.

I complained about this to Railway, and they basically said they need to do this to protect neighboring apps from problems caused by my app. I was under the impression that this was the purpose of containers, but what do I know.

I'm guessing Railway is having issues with apps on their free tier wasting CPU resources, costing them money, and found some free-tier apps succumb to these infinite loop CVEs, wasting free vCPUs. But I'm not on the free tier -- I pay for what I use.

I've been a huge fan of Railway for hosting this site. It really is simple and fast, and that's been perfect for such a simple site. I don't need a lot of fancy features, and Railway has let me deploy my code easily without worrying about the details. But this security overreach gives me pause, so this is a great time to look at other hosting providers.

I talked to the founder of Railway's competitor Northflank about this issue, and his reply was, "We would never block you from deploying if we can build or access a valid container image." To me that's the right answer.

Bird's eye view

When I worked at Google, for a long time I was involved with production reliability/safety. I wasn't an SRE, but I worked on services that were in the critical path for ads revenue. Incidents often cost us untold millions of dollars, so reliability was a huge concern.

I wrote and reviewed many dozens of postmortem analyses, and I learned a lot about the typical ways that systems fail. I was in 24x7 oncall rotations for my entire career at Google, and during the latter years I was oncall as an Incident Commander inside ads, running the responses to (very) large incidents.

Google's ads systems were mostly pretty mature, and contained lots of safety checks, etc. I always found it really fascinating to see the ways that a serious outage would slip through all the defenses.

One pattern I came to recognize was that when deciding on the action items / follow-ups for an incident postmortem, the go-to reaction was always to add another safety check of some kind. This makes sense! Something went wrong, and it slipped past our defenses, so more defenses should help.

However I came to view adding more and more defenses as an anti-pattern. There's a baseline level of safety checks that make a system safer, but if you are not careful, adding more safety checks actually increases risk, because the safety checks themselves start to be the cause of new problems. There's a great word for this, by the way, iatrogenesis.

I'll give an example. I worked in (anti) ad fraud, and we had systems that would block ad requests that looked fraudulent or scammy. Bugs in these systems were extraordinarily dangerous, because in the worst case we could turn off all advertising revenue globally. (Yes, that is a thing that happened. More than once. I have stories. Buy me a beer sometime...)

These systems accumulated many safety checks over the years. A typical thing would be to monitor the amount of revenue being blocked for fraud reasons, and alert a human if that amount grew too quickly.

But in some cases, these safety checks were allowed to automatically disable the system as a fail-safe, if things looked bad enough. The problem is... there's not really any such thing as "failing safe" here. For a spam filter, failing closed means revenue loss, and failing open means allowing fraud to get through.

So imagine that a script kiddie sits down and writes a bash loop to wget an ad click URL 10,000 times per second. The spam filtering systems trivially detect this as bot traffic, and filter it out so that advertisers aren't paying for these supposed clicks.

But maybe some safety check system sees that this trivial spam filter has suddenly started blocking hundreds of thousands of dollars of revenue. That's crazy! The filter must have gone haywire! We better automatically disable the filter to stop the bleeding!

Uh oh. Now those script kiddie requests are being treated as real ad traffic and billed to advertisers. We have ourselves here a major incident, caused by the safety check that was meant to protect us, created as the result of a previous postmortem analysis.

I'm not saying that all safety checks are bad. You certainly need some. But what I am saying is that if you allow extra safety checks to become the knee-jerk reaction to incidents, you will eventually end up with a complex mess where the safety checks themselves become the source of future incidents. You build a Rube Goldberg Safety Machine, but it doesn't work. The marble flies off the rails.

It's really attractive to just keep adding safety checks, but what is actually required is to take a step back, look at the system holistically, and think about simple ways to make it safer. Generally after a few basic safety checks are in place, further safety comes from redesigning the system itself. Or even better, rethinking the problem that you're trying to solve altogether. Maybe there's a completely different approach that side-steps an entire category of reliability issues.

Anukari on the CPU (part 3: in retrospect)

Captain's Log: Stardate 79350.3

In part 1 of this series of posts, I explained the shortcomings in Anukari’s GPU implementation, and in part 2, I covered how the new CPU implementation manages to outperform the GPU version.

Prior to this change, I had invested a significant amount of time and effort on the GPU implementation (as the back catalogue of this devlog demonstrates). Obviously I would have preferred to find the improved CPU solution earlier, before putting so much effort into the GPU solution. In this 3rd and final installment I will reflect upon what went well, and what I could have done better.

The (unachieved) goal of 50,000 physics objects

As I discussed in part 1, my first prototype used the CPU. It was with this prototype that I “proved” to myself that to simulate the target number of physics objects that I wanted (50,000), the CPU was not an option.

I still think that this is true. Actually I’m even more confident, having now written a heavily-optimized CPU simulator. But where I went wrong was in my assumption about it being important to simulate 50,000 objects.

In my initial testing, I found that larger physics systems produced some really interesting results. I can’t quite recall how big of a system I had tested, but surely it was only a few hundred masses, maybe a thousand objects counting the springs.

From this I got excited and extrapolated out a bit too far. If 1,000 objects sound cool, 50,000 must sound even cooler, right?

Now, given that I’m currently writing about the avoidance of unfounded assumptions, I better not rule out the idea that 50,000 objects is really great. Maybe it is! But I never proved that. Furthermore, I did know that simulating 1,000 objects had excellent results.

Maybe I could have saved myself a lot of grief if I had questioned this assumption before building out all the GPU code.

On the other hand, I do want to give myself credit for being ambitious by going for the 50,000 goal. So I am not sure that I’ll criticize myself too much for going ahead with the GPU implementation. Overall I’d rather err on the side of being too ambitious than the opposite.

I think my real self-critique is not that I embarked on the GPU road, but that I stayed on it too long. When I first realized that I was not going to achieve my goal of 50,000 objects, that would have been a great time to take a step back and reevaluate whether the GPU was necessary.

At Google we often loosely used the rule of thumb that a 10x increase in system workload was about the time when you had to start thinking about redesigning a system (rather than incrementally optimizing it).

So changing my design goal by a factor of 50x should have been an obvious signal that maybe a different, simpler design was worth evaluating!

That said, at this point in the project I was not yet aware of how much of a headache using the GPU was going to be. So maybe I would have charged forth with the GPU anyway.

With these reflections out of the way, the reality of course is that I did eventually reevaluate my options and found a better solution. I do wish I had done this earlier, but I am proud of how quickly I changed horses once I began to realize that the one I was on was not optimal. I did not allow myself to succumb to the sunk cost fallacy, which is something I find personally challenging. So I’m happy about that.

And overall, of course, I’m pleased to have found a solution that works significantly better for my users, which is what’s really important to me.

On the drawbacks of using the CPU

The new CPU simulation is way faster and simpler than the GPU implementation, but that doesn’t mean there aren’t disadvantages.

One drawback to the change is that it might make me look foolish. I’m not too worried about this, though. I’d rather eat some crow and have a plugin that works really well than the alternative.

Another drawback is that the GPU support was potentially good from a marketing perspective. Using the GPU is unusual, and stands out as something interesting, like “alien technology.” Again, I am not going to lose any sleep over this. I didn’t write GPU code to make the plugin marketable. I used the GPU because at the time I thought it was the most effective way to make the plugin work really well.

The engineering drawbacks are more interesting. One advantage of the GPU is that it is mostly untapped processing power (for audio). So a user with a super CPU-hungry audio production setup might be able to run a GPU-based plugin, taking advantage of that extra processing capacity that would normally go unused.

This is in fact a great reason for plugin manufacturers to exploit the GPU. But of course this only makes sense if a plugin can be made to work really well on the GPU. I do not doubt that this is true for other plugins, but as I have written about at great length, for now Anukari runs much better on the CPU.

The way I see it, a plugin that works great but uses more CPU resources is always better than a plugin that works poorly and glitches but consumes less CPU. For me, Anukari’s usability trumps everything else. If users can’t reliably run interesting presets without glitching, nothing else matters.

That said, I do feel sad about anyone whom I’m disappointed with this change. It’s completely understandable that someone may have been excited by the GPU support (I sure was!), for any number of reasons. I do not enjoy letting anyone down.

But it’s not like I didn’t try to make the GPU support work! I invested many hundreds, if not thousands of hours into that approach. This devlog attests to that.

I also have to think about those users whom I’d disappoint if I didn’t improve Anukari’s performance. Many people’s machines could not run Anukari at all, and now they can. Many users who had glitching in Logic Pro before now find that the plugin runs flawlessly. VJs that run GPU-intensive visualization software can now run Anukari. The #1 complaint I’ve had since the start of the Beta is performance, and that is way less of an issue now.

Benefits of using the CPU aside from performance

Recall that for GPU support I was maintaining three backends: CUDA, Metal, and OpenCL. (Arguably I should also have supported AMD’s ROCm, but it’s a huge mess. That’s a whole other story.)

Now I’m just maintaining the one CPU backend. Granted, there is separate hand-written AVX and NEON code to support the two platforms, but this is limited to the hottest loops, and is not a big deal. AVX and NEON are far more similar to one another than the various GPU hardware and APIs are.

Overall the new simulator is vastly simpler than the old one. There’s much less indirection, since instead of orchestrating GPU kernels (including setting up buffers, managing data copies, etc) the code simply does the work directly.

This means that adding new features to the physics simulation is going to be substantially easier. This is exciting, because I have a gigantic spreadsheet of new features I’d like to add.

There’s also the fact that I just got all the future opportunity cost from dealing with GPU issues back. In other words, instead of spending hundreds more hours dealing with GPU compatibility issues, those hours will now go into new physics features, UX improvements, etc.

Also from a reliability perspective, the CPU code is far easier to test, profile, and debug. I can throw in log statements wherever I like. I can use the same instrumentation, profiler, and debugger as for the rest of the app. Unit tests are much simpler to write without having to call into the GPU code. And manual testing prior to releases no longer requires going through quite so big of a stack of laptops.

Keep in mind that I was previously spending time on horrific GPU issues like this AMD driver bug which was specific to just one AMD graphics chip. Now I have that time back.

I am really looking forward to making Anukari an even more interesting and useful sound design tool now that my time has been freed up to do so.

How tests helped make the GPU to CPU change possible

One thing that helped a lot in rewriting the physics simulation was the existing body of tests. Since the prior GPU implementation supported multiple backends, all the tests were already backend-agnostic. These tests are extremely thorough, so while writing the CPU backend, I was basically doing test-driven development.

I can’t imagine how much harder this would have been without the tests. They caught way more issues than I can count. And once I had them all passing, it gave me enormous confidence that the new simulation was working correctly.

Far and away the most useful tests were my golden tests. At the moment I have close to 150 tests that each load an Anukari preset, feed in some MIDI/audio data, and generate a short audio clip. This clip is fingerprinted and compared to a “golden” clip. Most of the golden tests isolate individual physics features to prove that they work, but there are a few tests that run large, complex presets to prove that interactions across features work as well.

Once the golden tests were passing, I was completely sure that I had implemented all the physics features correctly.

Correct audio output is obviously important, but stability is just as important. The CPU code is heavily-optimized and does a lot of low-level memory access stuff where it’s easy to screw up and crash by accessing the wrong memory.

For this, I mostly relied on fuzz testing and the chaos monkey.

For the fuzz tests, I have a testing API that can generate random mutations to an Anukari preset. The API is, in principle, capable of generating any Anukari preset. When I’m adding a new physics parameter, one of the very first things on my checklist is to update the fuzz API to be aware of it.

I use this API in a number of fuzz tests. The basic pattern is a loop that makes a random preset mutation, and then checks an invariant. Doing this tens of thousands of times tends to catch all kinds of bugs, especially because the random presets it generates are just awful, twisted messes. When randomly setting parameter values, the fuzz API has a slight preference for picking values at the extremes of what Anukari supports, so it ends up generating super weird presets that a human would never create.

One of the fuzz tests generates presets in this way and then simulates them, verifying that the simulation’s output is reasonable (i.e. no NaN/inf, etc). And of course these tests verify that no inputs can cause the simulator to crash. Running the fuzz test loop for a long enough time with no crashes gave me huge confidence that I had worked out all of the crash bugs in the new simulator.

The chaos monkey is sort of the last line of defense. It opens the standalone Anukari app and generates random mouse and keyboard inputs. This has mostly been useful for catching GUI bugs and crashes, but it also verifies that the simulator does not crash in real-world conditions.

It’s hard to say, but I think that without all these tests, my CPU rewrite attempt may have simply failed. At the very least it would have taken 5x as long. The golden tests were especially important. I can’t imagine manually setting up presets and checking that 100+ individual physics features worked correctly, over and over.

The value of generative AI

I wrote above about how I wish I would have tried a SIMD-optimized CPU solution for Anukari’s physics simulation earlier in the project.

One subtle issue with that idea, though, is that 2 years ago I wasn’t really using GenAI for programming. I had played with it a bit, of course, but it wasn’t a core part of my workflow.

I’m not sure how successful my SIMD attempts would have been without GenAI. I’m no stranger to assembly code, but I am far from fluent. We might say that I speak a little “broken assembly.” The main issue is that I write assembly very slowly, as I have to frequently reference the documentation, and I’m not always aware of all the instructions that are available.

So if I had attempted the SIMD approach a couple years ago, it would have gone way more slowly. I could not have experimented with multiple approaches without a lot more time.

Starting about a year ago, GenAI became a core part of my programming workflow. I don’t find that “vibe coding” works for me, but GenAI is an amazing research assistant.

By far the highest leverage I’ve found with GenAI is when I am learning a new API or technology. Having the LLM spit out some example code for the problem I’m solving is incredibly valuable. I always end up rewriting the code the way I want it, but it saves a ridiculous amount of time in terms of figuring out what API functions I need to call, what headers those come from, the data types involved, etc.

For writing the SIMD code, GenAI was a massive superpower. Instead of constantly fumbling around in the documentation to figure out what instructions to use, I asked GenAI and it immediately pointed me in the right direction.

I mostly wrote AVX code first, and then ported it to NEON. This is probably the closest that I approached to vibe coding. In many instances, asking GenAI to translate AVX to NEON, it produced perfect working code on the first try, at least for simple snippets.

Don’t get me wrong, GenAI also produced a ton of absolute garbage code that wasn’t even close to working. But that’s not really a problem for me, since I don’t use it for vibe coding. I just laugh at the silly answer, pick out the parts that are useful to me, and keep going.

Not only did GenAI save me a lot of time in scouring documentation to find what I needed, but also I ended up running experiments I otherwise would not have. This is one of those situations where a quantitative effect (writing code faster) turns into a qualitative effect (feeling empowered to run more interesting experiments).

The “superpower” aspect was really the fact that GenAI emboldened me. Instead of worrying about whether it was worth the time to try an optimization idea, I just went ahead with it, knowing that my research assistant would have my back.

Appendix: Further challenges with using the GPU for audio

One last detail to mention is that running Anukari’s simulation on the GPU also incurs overhead from scheduling GPU kernels, waiting for their results, etc. While it’s nice to avoid this overhead, it’s actually extremely tiny on modern machines. Especially with Apple’s Metal 4 API and unified memory, this overhead is not very consequential, except at the very smallest audio buffer sizes.

However even if the fast-path overhead of GPU scheduling is good, there are still problems. Notably, GPU scheduling is not preemptive. Once a workload is scheduled, it will run to completion. This means that if another process (say, VJ visualization software) is running a heavy GPU workload, it can interfere with Anukari’s kernel scheduling, leading to audio glitches. To some extent this can be ameliorated using persistent kernels, but all kernels have deadlines and thus need to be rescheduled periodically, opening the door to glitches.

The fact that GPU tasks are not preemptive is a fundamental issue in the world of realtime audio. It’s not unsolvable (e.g. permanent task persistence could be a solution), but it is tricky. It’s also worth noting that the macOS Core Audio approach of using workgroups for audio threads does not apply to the GPU, so the OS has no way of knowing that a GPU task is realtime. This is an OS-level feature that Apple could add to make GPUs more audio-friendly. But given how little the GPU is used for audio, it seems very unlikely that Apple will invest in this.

I don’t think that true preemption on the GPU is something that would ever be realistic. Between the huge amount of register memory and threadgroup memory that GPUs have, switching a threadgroup execution unit between workloads would be way too expensive. We’re talking about copying half a megabyte of state to main memory for a context switch.

This means that even if GPU APIs supported kernel priority (which they mostly don’t), it still could not solve the audio glitch issue, because if the GPU was already fully utilized with running tasks, even a realtime-priority task could not preempt them. The priority would only mean that the task would be the first to start once the existing workload was finished.

Probably the only true solution for flawless audio on the GPU would be for the OS to provide support for reserved GPU cores, alongside realtime priority for low-latency signaling between the CPU and GPU. This would allow realtime applications to run persistent kernels on the reserved cores without any issues with other workloads “sneaking in.”

I believe that Apple might be able to figure out a solution to this and pull it off. NVIDIA also. Definitely not AMD, though, their drivers are hopeless to begin with, even for the simplest use cases. Also I would not expect Intel’s integrated graphics to do this. So even in a perfect world where GPU vendors decide to make audio support a first-class priority, in my opinion its usefulness will be limited to the two top vendors, and users with Radeon or Intel graphics will not benefit.

(Maybe at some point I will write more about these challenges, and the solutions I came up with, but for now I am mostly happy to just move to the CPU and forget all these complex headaches.)

Anukari on the CPU (part 2: CPU optimization)

Captain's Log: Stardate 79317.7

In part 1 of this series of posts, I mentioned that I was surprised to find that naively running my GPU code on the CPU was only 5x slower, when I thought it would be 100x slower. In this post I will explain how I ended up making the CPU implementation much faster than on the GPU.

First approach: spot-vectorization

As mentioned in part 1, I got the original GPU code compiled for the CPU, and then wrote a simple driver to call into this code and run the simulation (in lieu of the code that set up and invoked the GPU kernel).

As you might imagine, Anukari, being a 3D physics simulation, does a lot of arithmetic on float3 vectors of the form {x, y, z}. In other words, vectors of three 32-bit floats. So the first optimization I did was the simplest and most naive thing I could think of, which was to implement all of the float3 operations using SIMD intrinsics. I knew this wouldn’t be optimal, but figured it would give me a sense for whether it was worth investing more work to design a CPU-specific solution.

Note that most of the time when one is dealing with float3 vectors, they are aligned in memory as if they were float4, in other words to 32-byte boundaries. So really you’re working with vectors like {x, y, z, w}, even if the w component is not actually used.

For this experiment I used the 128-bit SIMD instructions offered by SSE on x86_64 processors and NEON on arm64 processors. Because Anukari’s float3 vectors are really float4 vectors with an ignored w component, it’s extremely simple to implement basic arithmetic operations using SSE/NEON. In both cases, there’s an instruction to load the float4 into a SIMD register, an instruction to do the arithmetic operation (such as add), and then an instruction to store the float4 register back into memory.

Thus, the Float3Add() function might look like this using SSE:

  __m128 p1 = _mm_load_ps(&position1);
  __m128 p2 = _mm_load_ps(&position2);
  __m128 d = _mm_add_ps(p2, p1);
  _mm_store_ps(&delta, d)

If you squint your eyes, the NEON code is identical:

  float32x4_t p1 = vld1q_f32(&position1);
  float32x4_t p2 = vld1q_f32(&position2);
  float32x4_t d  = vaddq_f32(p2, p1);
  vst1q_f32(&delta, d);

Along these lines I implemented all the basic float3 operations. Some operations were slightly more complex, such as the dot product, normalization, etc, but this code was all very simple and easy to write, because it dropped straight in to replace the GPU code that operated on float3 vectors in the same way.

This was only an afternoon of work, and already I was seeing big speed improvements in the CPU implementation. It was still not as good as the GPU version, but at this point I was convinced that it would be worth redesigning the physics simulation from scratch to take full advantage of CPU SIMD instructions.

Second approach: compiler auto-vectorization

Now that I’ve explained why I decided it was worth redesigning the physics simulation in a CPU-specific way, I should say a few words about why the way the code was written for the GPU is not optimal for the CPU. There are a multitude of reasons, which I’ll lump broadly into two categories.

1. Memory access patterns

The GPU code is heavily-optimized to take advantage of the unique memory situation on the GPU.

Anukari’s GPU design relies on the fact that each threadgroup has a huge number of registers. On a typical modern NVIDIA card, a threadgroup has 65,536 32-bit float registers. Compared to a CPU, this is an unfathomably large number. There are some limitations, such as the fact that a single thread can access a maximum of 256 registers, but since Anukari runs up to 1,024 threads, this limit is not relevant (since 65,536 / 1024 = 64 registers per thread).

Note the very surprising fact that each NVIDIA thread group has more register memory than a typical modern CPU core has L1 cache! This calls for a very different design. Registers are the fastest memory possible, and so Anukari crams as much thread-private information into them as it can.

Another thing Anukari’s GPU design relies on is that all the threads in a thread group have extremely fast access to shared thread group memory, which is essentially manually-mapped L1 cache. Anukari uses this for fast inter-thread communication of physics parameters. For example, when calculating spring forces for a given mass, the position of each neighboring mass is looked up from this shared memory. These lookups are random-access so it’s very important that they are served straight from L1 without any possibility of going to main memory.

On the CPU, Anukari can’t rely on a gigantic register space or ultra-fast shared memory. The design will have to be centered around a new memory layout that is friendly for the CPU cache design.

Note that especially when using SIMD instructions, caching becomes incredibly important, because the SIMD calculation throughput is generally higher than main memory throughput. So the only way to saturate the FPU is to ensure the memory layout allows efficient caching (and prefetching).

2. SIMD-unfriendly control flow

When I was first optimizing Anukari’s GPU code, it was a bit of a surprise to me to find out that for the most part, each GPU thread is a scalar processor. Unlike a CPU, there are no special instructions for operating on vectors, because the parallelism comes from zillions of threads instead. There are some exceptions to this, like Apple Silicon hardware being able to do vector memory loads, but if you look for example at CUDA you’ll find that there’s no Float3Add() method in the APIs. You just have to write it yourself with scalar code.

In many ways this makes the bottom-level GPU code easier to write for highly-parallel problems, because you can just write simple scalar code. You don’t have to worry about the parallelism because that happens at a higher level (threads). Other aspects are more difficult on the GPU, but this part is nice.

But this kind of code layout does not map well onto the CPU, where parallelism comes from SIMD instructions that operate on multiple floats at one time (vectors). On the CPU, you have to think much more explicitly about how to lay out the data so that it can be operated on in SIMD fashion.

For example, take this simple operation:

  for (int i = 0; i < m; ++i) {
    float x = ComputeX(i);
    float y = ComputeY(i);
    result[i] = x * y;
  }

On the GPU, this code might well be fine. Each thread is basically a scalar machine, so the scalar multiply of x and y is efficient.

But on the CPU, this code is awful. Each multiply will map to a single scalar instruction, which fails to take advantage of the fact that the CPU is capable of 4, 8, or even 16 multiplications via a single instruction (depending on the CPU). Storing the result of the multiplication similarly could be made more efficient with SIMD instructions, but is not.

Auto-vectorization

One way to restructure the above GPU code to take better advantage of the CPU’s resources is as follows:

  float xs[m];
  float ys[m];
  for (int i = 0; i < m; ++i) {
    xs[i] = ComputeX(i);
    ys[i] = ComputeY(i);
  }

  for (int i = 0; i < m; ++i) {
    result[i] = xs[i] * ys[i];
  }

This looks a bit silly at first. We’re doing two passes over the loop instead of one, and we have to store data in two intermediate arrays.

However there is magic that can happen in the third loop, which is that the compiler can automatically output SIMD instructions instead of scalar instructions. The compiler sees that each of the three arrays involved are contiguous blocks of floats, and thus they can be loaded as vectors (of whatever width the targeted instruction set handles), operated on as vectors, and stored as vectors. For a machine with a SIMD width of 8, this loop will process 8 floats per iteration, which is potentially a gigantic speedup.

There are many caveats here. For a full 8x speedup, the data being operated on definitely needs to fit into L1 cache, or else the loop will be memory-bound and while it still may be faster than scalar instructions, it won’t be 8x faster. As a consolation prize, though, if the memory is not already cached, at least it is contiguous, so the processor’s automatic prefetching will be effective.

In many cases, especially with small datasets, writing the code this way can allow the compiler to automatically output efficient platform-specific instructions while keeping the code portable and readable.

At this stage I did a complete rewrite of Anukari’s physics engine, restructuring the data structures so that they were laid out in memory in a way that made auto-vectorized loops easy to write. I encapsulated all the basic math operation loops in functions so that the code was pretty readable:

  float deltas[m];
  Subtract(m, positions1, positions2, &deltas);
  
  float lengths[m];
  Length(m, deltas, &lengths);

And so on. This was a pretty huge overhaul, since the code had to be structured very differently than on the GPU.

But it was worth it. At this point, the speed of the physics simulation on the CPU was very similar to the GPU. For some Anukari presets it was even faster.

Third approach: manual vectorization via intrinsics

Compiler auto-vectorization is really nice. The fact that it lets you write portable code without worrying about the specifics of the platform you’re targeting feels pretty magical. The compiler automatically uses the largest SIMD width available for the platform, and also deals with things like outputting a bit of scalar code at the end of each vector loop to handle “remainder” elements, to handle arrays where the length is not an exact multiple of the SIMD width.

However, there are limitations to what compilers can do. They may feel magical, but they are not.

I found that the compilers I’m using (MSVC, Apple Clang) both had extremely strict limitations on what kinds of loops they can vectorize. The limitations differ between compilers, but in general I’d describe them as extraordinarily conservative.

The problem compiler authors have is that the visible behavior of the instructions they generate must adhere to the language spec at all times. So they have to detect cases where there’s any possibility that vectorization might produce observably different results than scalar code, in which case they cannot vectorize.

This means that to write code that the compiler can auto-vectorize, you have to be extremely deliberate. You have to learn about the compiler’s requirements in detail, and then have to structure the code in a very particular way to ensure that the compiler has nothing to be paranoid about.

The MSVC compiler is really nice here, because you can ask it to give you an error code for any loop that is not vectorized. But there are more than 35 reason codes it might generate for why a loop is not vectorized. It’s extremely cautious! Here’s one example of a loop that MSVC cannot vectorize, taken from their documentation:

   // Code 501 is emitted if the compiler cannot discern the
   // induction variable of this loop. In this case, when it checks
   // the upper bound of 'i', the compiler cannot prove that the
   // function call "bound()" returns the same value each time.
   // Also, the compiler cannot prove that the call to "bound()"
   // does not modify the values of array A.

   for (int i=0; i<bound(); ++i) {
       A[i] = A[i] + 1;
   }

This leads to situations where as the programmer, you can see that a loop is trivially safe to vectorize, but the compiler cannot prove to itself that it can do so safely and thus emits scalar instructions.

At first I jumped through all the hoops to structure my code in a way where the compiler could auto-vectorize it. All pointers used inside loops were stored as locals to prove that they were not modified concurrently by another thread. All pointers were marked as __restrict to guarantee that they were not aliased. Loops were marked with compiler-specific pragmas to inform the compiler that there were no data dependencies to worry about. And so on.

But in the end, I found that despite jumping through all the hoops, I still had to look at the assembly output every time I compiled my code. Both compilers have options to produce warnings about non-vectorizable loops, but I did not find these completely reliable. Even when they worked, sometimes I found that the way the compiler vectorized a loop was weirdly inefficient.

Ultimately I came to the conclusion that if I was going to have to look at the compiler’s assembly output after every compile, on both of my target platforms (x86_64 and arm64), then I may as well just give up on compiler auto-vectorization and write platform-specific instructions directly using intrinsics.

Initially I was resistant to this idea, but very quickly I found that it was much simpler to explicitly tell the compiler what instructions to use than to hope that it implicitly figured out what I wanted. Knowing what I know now, I will never attempt to rely on auto-vectorization again for a problem that requires SIMD instructions. It is easier to just do it yourself.

So I went through my SIMD code, and for example rewrote functions that were originally like this (leaving out a lot of the details):

void Subtract(int m, float* xs, float* ys, float* results) {
  for (int i = 0; i < m; ++i) {
    results[i] = xs[i] - ys[i];
  }
}

To look like this instead:

void Subtract(int m, float* xs, float* ys, float* results) {
#if USE_AVX2
  for (int i = 0; i < m * 4; i += 8) {
    __m256 x = _mm256_load_ps(&xs[i]);
    __m256 y = _mm256_load_ps(&ys[i]);
    __m256 r = _mm256_sub_ps(x, y);
    _mm256_store_ps(&results[i], r)
  }
#elif USE_NEON
  for (int i = 0; i < m * 4; i += 4) {
    float32x4_t x = vld1q_f32(&xs[i]);
    float32x4_t y = vld1q_f32(&ys[i]);
    float32x4_t r  = vsubq_f32(x, y);
    vst1q_f32(&results[i], r);
  }
#endif
}

This may look complicated at first glance, but really it’s dirt simple. It’s not actually complicated, it’s just that it’s so explicit that it’s a bit verbose. I have found that it is much easier to get the results I want this way than with auto-vectorization.

By the way, you may have noticed that my hand-written code does not include a scalar remainder loop to handle the case that the length of the input arrays is not an exact multiple of the SIMD width. How? To keep the code simple, I obviate the need for remainder loops by guaranteeing that all array lengths are rounded up to the SIMD width.

This does mean that some extra unused floats may be processed at the end of the loop. But when the arrays are large, this becomes a rounding error.

(Note to readers who may try this: if you write a SIMD loop that does a reduction, for example accumulating a sum, be very careful about remainder elements so that you don’t accumulate garbage data! You can guess how I know this.)

At this stage of the physics code rewrite, the CPU simulation’s speed is now uniformly a little faster than on the GPU.

Current approach: bespoke single-pass intrinsics

Despite now having a CPU implementation that worked better than the old GPU version, I was still not satisfied. It was now becoming clear to me that actually I could make the CPU code not just a little faster, but substantially faster than the GPU code.

While the SIMD-vectorized loops above make efficient use of SIMD instructions, and access memory in a linear way that is cache-friendly, there are still some obvious drawbacks.

One problem is that the code ends up iterating over the same data many times, in many loops. If the dataset fits into L1 cache, this is quite fast, but reading L1 is still slower than reading registers, and also we have to consider overhead of the loop instructions themselves. If the dataset is bigger than L1 cache, it’s a performance disaster as things spill to L2 which is much slower.

A deeper problem, though, is that Anukari’s physics simulation has a few places where the data has to come from random access memory fetches rather than beautifully-efficient linear accesses.

For example, while iterating over each spring, we have to fetch the position of the spring’s endpoints from the masses it’s connected to. And later we have to accumulate the spring’s force into memory for each mass, which is also random access. Unfortunately we cannot assume this data is in the L1 cache.

There are tricks here, like sorting the springs by the memory address of their attached masses. But because each spring is attached to two masses, this can only completely linearize the lookups for one end of the springs. It helps, but there is still quite a bit of random access for the other end.

After a few rounds of profiling and optimization, it was clear that these random accesses were the biggest bottleneck, so I brainstormed ways to improve the situation.

The original code was structured in three parts: 1. a gather pass where random access fetches were stored into a linear array, 2. the processing passes, and 3. a scatter pass where spring forces were written out to random access locations. Both the scatter and gather passes were quite slow.

The trick that ended up helping was to rewrite this code as a single pass that performed the processing in-line with the gather and scatter, so that the processor could pipeline all these instructions and ultimately do all the computation while waiting on the memory fetches and stores.

The original multi-pass algorithm allowed the processor to sit idle with an empty pipeline while waiting for fetches/stores, whereas this new algorithm kept it busy. The pipelining not only parallelized computation with memory access, but also allowed the fetch and store instructions to overlap.

Furthermore, I paid attention to how many CPU registers the loop required, looked at the assembly output to confirm, and managed to write it in such a way that the loop does not spill registers onto the stack. So in some ways Anukari’s spring simulation is theoretically optimal: it reads each required piece of data exactly once from memory, does all the intermediate computations using only registers and only full-width SIMD instructions (zero scalar operations), and writes the result exactly once.

This technique provided a 3-4x speedup for the spring simulation, which tends to be Anukari’s single hottest loop. Seeing this advantage, I then went on to rewrite the next few hottest loops using the same technique. Even loops that do not perform random memory accesses benefit substantially from less loop overhead and fewer round-trips to L1 cache.

After this round of optimization, the CPU code is now substantially faster than the GPU code in virtually all cases.

Drawbacks

The drawback to this latest way of structuring the SIMD code is that it does make the code significantly more complicated and more difficult to read. Before, the simulation was a series of tidy API calls like Add(), Subtract(), Length(), and so on, and the platform-specific SIMD intrinsics were tucked away in the implementation of that API in compact single-purpose functions.

In the new world, though, the simulation for each type of physics object is completely written inline using platform-specific SIMD intrinsics with no tidy abstraction. In itself this is not so bad, but to squeeze out maximum performance, some of these loops have to be structured in pretty weird ways.

I mentioned above that the hottest loops do all operations at full SIMD width with zero scalar operations. This is the design factor that introduces the most complexity.

The problem is that Anukari often does operations that alternate between 3-dimensional float3 vectors, and scalar values. For example, it might take the length of a float3, which goes from 3D to 1D, and then multiply another float3 by that length, going back from 1D to 3D.

The length operation involves taking the square root of a scalar value, which is very slow, and we definitely want to vectorize. But to vectorize the square root, we need to have enough scalar values to fill a SIMD register.

Using NEON for this example, the SIMD width is 4, meaning a register contains 4 32-bit floats. So if we want to vectorize the square root, we have to unroll the outermost loop so that on each iteration it processes 4 float3s. It takes the dot product of each one with itself to get 4 squared lengths, which are packed into a single SIMD register and then a single square root instruction can be used to compute all 4 square roots at once.

This gets quite messy, and some cleverness is involved to structure things such that the shuffling of 32-bit floats around between SIMD registers can be done using only SIMD instructions (rather than pulling individual SIMD lanes out into float registers and re-packing, using many instructions). But this is all doable, and the results are incredibly worthwhile.

Sadly, because there is manual loop unrolling and weird clever shuffling going on, the structure of the code at this point diverges pretty dramatically between AVX2 (with a SIMD width of 8) and NEON (SIMD width of 4). So there are two very independent implementations for the hottest loops.

An aside about float3 handling with SIMD

Earlier in this post, I mentioned that often when handling float3 vectors, they are processed as float4 vectors with an ignored “w” component. This is a bit sad, because it means that 25% of the processor’s SIMD lanes are being wasted.

There is a way to eliminate this inefficiency and to utilize 100% of the SIMD lanes.

The reader may be familiar with the terms Structure of Arrays (SoA) and Array of Structs (AoS). The version of the SIMD code that operates on float3s with an ignored “w” component would be considered AoS, because each float3 is a structure of 4 floats which are contiguous in memory: {x, y, z, w}:

void Subtract_AoS(int m, float* xs, float* ys, float* results) {
  for (int i = 0; i < m * 4; i += 8) {
    _mm256_store_ps(&results[i],
        _mm256_sub_ps(
          _mm256_load_ps(&xs[i]),
          _mm256_load_ps(&ys[i])));
  }
}

But it’s possible to reorganize the data structures in a SoA fashion, in which case we can rewrite the code in a way that does not waste any SIMD lanes:

void Subtract_SoA(int m, float3* a, float3* b, float3* result) {
  for (int i = 0; i < m; i += 8) {
    _mm256_store_ps(&results->x[i],
        _mm256_sub_ps(
          _mm256_load_ps(&a->x[i]),
          _mm256_load_ps(&b->x[i])));
    _mm256_store_ps(&results->y[i],
        _mm256_sub_ps(
          _mm256_load_ps(&a->y[i]),
          _mm256_load_ps(&b->y[i])));
    _mm256_store_ps(&results->z[i],
        _mm256_sub_ps(
          _mm256_load_ps(&a->z[i]),
          _mm256_load_ps(&b->z[i])));
  }
}

This is pretty verbose and repetitive, but notice that the main change is that instead of one array of {x, y, z, w} structs, there are now three arrays, one for each of the x, y, and z values.

There are in fact two big advantages to this approach. First is the one we’ve already mentioned, which is that there is no “w” component being computed, and thus there are no wasted SIMD lanes. The second advantage is that the number of instruction operations per loop cycle is tripled, so the overhead of the loop instructions themselves (compare and increment) is decreased. (It’s a subtle detail, but notice that the second version’s for loop has i < m instead of i < m * 4.)

Due to these advantages, I originally used the SoA version of this code for Anukari. However, it eventually turned out that in all the cases that mattered, the AoS version was much faster, so I restructured to use AoS instead.

I found this a little surprising, because whatever advantage the AoS version has, it has to be enough to make up for the 25% SIMD lane wastage.

The reason AoS performs better for Anukari comes back to the fact that the slowest parts of the simulation code are where random memory accesses take place. In the SoA memory layout, randomly accessing a float3 actually requires three random scalar memory accesses. Whereas a random access of a float3 in the AoS layout is a single vector access.

Therefore while the computation passes over the AoS data are somewhat slower, the random accesses in other parts of the code are dramatically faster, more than making up for it.

A slightly more subtle consideration has to do with the size of a cache line, which for the processors I care about is always 64 bytes. Both the SoA and AoS approaches do random lookups of data that is smaller than a single cache line. The CPU has to always read a full cache line from memory, which means that it’s reading extra bytes that we do not use, wasting memory bandwidth.

For SoA, we look up three 4-byte values from separate memory locations. This means that the CPU will fetch three 64-byte memory regions, wasting 180 bytes of memory bandwidth.

For AoS, though, we only look up one 32-byte value. The CPU will fetch one 64-byte memory region, wasting only 32 bytes of memory bandwidth.

In each case it’s possible that there are cache hits, depending on the random access pattern, in which case not all of the fetches are wasted. But if the dataset is large enough, or the lookups sparse enough, SoA wastes 6 times as much memory bandwidth than AoS does.

One last little side-side note: for random memory accesses, CPUs offer a feature called prefetching, which are instructions that tell the CPU about a future memory access that will be made. The idea is that in a random access loop, as you access each item you also tell the CPU about an item you’ll need a few iterations later. In principle this can allow the CPU to start fetching that memory before it is needed, concurrently with execution of the current instructions.

I experimented with prefetching quite a bit, and never found success with it. I’m not completely sure why, but my best guess is that it adds additional instructions, and my loops are already pipelined heavily enough that the cost of executing a few more instructions was not paid back by the prefetching.

Stay tuned

Despite the fact that I’ve probably already written way more than anybody ever wanted to read on this topic, there will be one last installment in this series of posts about Anukari on the CPU, where I will reflect on some of the lessons I learned as part of this process, as well as some other general thoughts on what went well and not so well.

Loading...

FacebookInstagramTikTokBlueskyTwitterDiscordYouTubeRedditAnukari NewsletterAnukari Newsletter RSS Feed
© 2025 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.