Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

Audio quality improvements

Captain's Log: Stardate 79463

My last couple of posts were about annoying website engineering stuff that I would have preferred to not spend time on. Fortunately while annoying, that wasn't a lot of work, and most of my time has still gone to working on the Anukari software itself.

A couple of weeks ago I released version 0.9.23, which was focused on audio quality improvements (full release notes here). There are also some pretty significant performance improvements, for example instruments with lots of microphones now perform much better, as I rewrote the mic simulation code in pure SIMD using all the tricks I learned with other entity types.

Now that performance is looking really good, I'm really happy that I've had the opportunity to work on the audio quality again. There's more performance work I can do, and I will at some point, but for now I am going to prioritize making the plugin sound better, by improving the existing physics simulation and by adding more audio features.

Master Limiter

One big thing in this release is that I replaced the master limiter, which for some presets could cause slight crackling, and in general, was flattening out the sound in an unpleasant way.

The limiter has a bit of history. Originally there was no limiter, no circuit breaker, and no automatic physics explosion detection. So when the physics system exploded due to crazy parameters, Anukari could make incredibly loud chaotic sounds. My wife and I referred to this as "Evan opened another gate to Hell in his office."

My first solution was the circuit breaker, which monitors the master RMS level and automatically pauses the simulation if a configurable limit is exceeded. This is really helpful when building presets, as it freezes the simulation before things get too chaotic, which allows you to undo whatever change you made that caused things to go haywire, and then go about your work.

Despite the circuit breaker, it was still possible to make really loud noises by accident. Sometimes it is possible to create an instrument that generates a loud sound just below the circuit breaker trip threshold, for example. And sometimes you don't want the circuit breaker on, e.g. while performing you probably don't want it to automatically pause.

So I added the master limiter, using the basic JUCE class as I expected it to be temporary. This seemed to work fine, guaranteeing that nobody's ears were melted by gateways to Hell.

Later when I added voice instancing, the physics explosion problem became more of an issue. Due to the way that Anukari uses time dilation to create higher pitches, every instrument will ultimately have a highest note that it can play without exploding, because the physics time step gets too large. So if you play a scale up the keyboard, you'll eventually hit a note that can't be simulated. The circuit breaker could catch this, but that's an awful user experience, since the whole simulation is paused.

Here I added automatic per-voice physics explosion detection. The most reliable signal I found was to monitor the maximum squared velocity of any object, and if it exceeds a given threshold, the given voice instance is automatically returned to its resting state. So if you play a note that's too high, it just won't do anything, or at worst you might get a light click and then silence. This way, when you play into a range that's not supported, the higher notes just don't make sound. Everything else keeps working.

I should also mention that at some point after I added the master limiter, I added compression for the microphones. This also massively reduced the possibility of producing gates to Hell, as even if they happen, the compressors will likely reduce the gain substantially and it won't be so bad.

Getting back to the master limiter, for a while I had noticed some very light crackling that I couldn't explain on some presets, such as SFX/Broken Reactor. It only happened with several voices playing loudly, but it was audible. Originally I assumed it was a problem with my compressor implementation, but I disabled the compressors and it still crackled. Ultimately I just kept disabling features until the crackle went away, and lo and behold, it was the JUCE Limiter class that was causing crackles.

Of course when I looked at the limiter code, I found a comment I wrote a year or two ago saying that the limiter crackled when the limit was set above 0 dBFS. I guess I thought I had fixed this by clamping the limit to a maximum of 0 dBFS, but I hadn't listened hard enough to realize that artifacts were possible below that as well.

The funny thing was: with the limiter disabled, some presets sounded way better. Not due to the absence of artifacts, since those were limited to some kind of weird presets. The dynamic range was much higher, which is one of the things I've always enjoyed about the sounds Anukari can make. Especially with percussive or metallic sounds, it's so important to have a lot of dynamic range.

JUCE's Limiter class employs two compressors with fixed parameters in series before a hard limiter with adjustable dBFS and threshold/release parameters. It turns out that it shapes the sound pretty significantly even when it's well below the hard limit.

Given that JUCE's Limiter sounded really bad for my use case, in addition to the crackling, I decided not to spend any time trying to fix it. I chose to get rid of any kind of shaping limiter entirely, and instead I went with a simple hard limit at +6 dbFS. Okay, not entirely hard, there's a polynomial taper, but it's pretty hard. I chose this threshold because it's easy to avoid clipping, and if the system goes haywire your eardrums will still be protected.

Voila, no more crackling, and way more dynamic range. This was a huge improvement.

Preset LUFS

After getting rid of the master limiter, I ran into a big issue, which is that many of the presets were much louder. In other words, they were relying on the master limiter to control their loudness. No wonder the dynamic range was squashed!

This meant that I had to go through and re-level all of the 200+ factory presets. This is something I wanted to do for a long time; the presets I made and the ones Jason made had pretty different loudness, and especially the ones I made were kind of all over the place.

To get this right, I installed the Youlean Loudness Meter 2 plugin to measure Anukari's integrated LUFS. This gave me an objective loudness metric. I targeted -15.0 LUFS for each preset under "typical" playing circumstances. The "typical" playing is a bit arbitrary, but I wrote some MIDI clips that I felt were reasonable for various kinds of presets. Big 4-note chords for pads and mallet instruments, fast lines for melodic instruments, single repeated notes for percussion, stuff like that.

While the LUFS metric was incredibly helpful, especially given how much ear fatigue I built up after many hours of leveling presets, I still relied on my ear to make the final judgement. Especially for instruments with very short note duration, integrated LUFS was not a great metric, and I was looking more at instantaneous LUFS and also simply listening.

It ended up taking two full passes over the presets to get the levels to a point where I was happy with them. But it was really worthwhile! Now you can cycle through the presets quickly, playing a couple notes on each one, and the volume level is far more consistent than before. You never have a preset jump out being twice as loud as the previous one. It feels much more professional.

The presets in general ended up being a bit quieter than before, so I also added a master output level knob. This should help especially in the standalone app when you want all presets to be a bit louder, and don't want to have to fiddle with the per-preset gain.

In addition, because I spent a lot of time cycling through presets, I made it so that when changing presets there's a very brief fade out/in. It wasn't a big deal, but if a preset was making noise when you cycled to the next one, there was a definite click. Now there's some softening to avoid any click. And I added this click-suppression in a couple other places, such as when the simulation is paused. It's a small thing but really feels good.

No More Ringing at Rest

Another issue that had long plagued Anukari was that some instruments would make a weird ringing sound when they were at rest. Basically, there was a digital noise floor. For most instruments, this was only audible if you cranked up the gain. But for instruments with extremely stiff springs, or lots of microphones, it was very audible. The worst offender was Mallet-Metallic/4 Ding Chromatic. It is one of my favorite presets, but it was really noisy.

Over the years I made several attempts to fix this, each time failing. I ran quite a few experiments on different formulations for the damping equations, since the ringing indicated that the system was somehow retaining energy. I did reduce the noise floor a bit with some very subtle changes to the damping integration, but never could get it to go away entirely.

For performance reasons Anukari uses single-precision (32-bit) floating point arithmetic for all the physics calculations. I always wondered whether using double-precision (64-bit) would help, but back in the GPU days this was not really an option, because many GPU implementations do not support doubles, and the ones that do are not necessarily very fast. In OpenCL, double support is optional and mostly not offered.

But a deeper problem with doubles on the GPU was that the physics state had to be stored in threadgroup memory, which is extremely limited. Doubling the size of the shared physics state structure would cut the number of entities that could be simulated in half, making many presets unusable.

Anyway, the new CPU physics implementation does not have the limitation of storing everything in the tiny GPU threadgroup memory. It's true that doubles will still use twice as much memory as floats, and that may have performance effects from reading more memory, and of course the SIMD operations would have half the width as the float versions. But I figured... why not give it a shot?

I hacked together the worst AI slop prototype of double support, being careful to only use double precision for the absolute minimal set of physics operations that might affect the ringing issue, and voila, the ringing was completely gone. It was always simply due to the lack of precision in 32-bit floats. This makes a lot of sense; basically with stiff enough springs and high enough gain, the closest position that a 32-bit float could represent to the true lowest-energy state might contain enough error to matter. At each step, a small force would be calculated to push things towards equilibrium, but the system would only orbit around equilibrium in accordance with the available floating point precision. (Of course 64-bit doubles still behave this way, but the error is way, way too small to be audible even with extremely high gain.)

Using doubles is slower than floats, for sure. But there are a couple things that made this change possible.

First, the slowest part of the simulation is the random access lookups to read the positions of the masses that springs are connected to, to calculate the spring forces. These lookups (and force writes) did not get appreciably slower! This may be surprising, but the reason why is pretty simple. All the processors that Anukari runs on use 64 bytes as the size of a cache line. The position of a mass is a three dimensional vector, which is really four dimensions for alignment reasons. So for 32-bit floats that's 16 bytes, and for 64-bit doubles it's 32 bytes. Notice that both sizes of floating point representation, the vector fits into one cache line. Because the lookups and writes are random access, and the memory being accessed is often larger than L1 cache, in both cases full cache lines are being read and written, and the size of the float makes no difference.

Second, while the SIMD computation bandwidth is cut in half for the 64-bit operations, in many cases the latency of the computations is eclipsed by the memory latency. The code is written carefully to ensure that computation and memory access are pipelined to the maximum extent. So in the situations where the memory access was the dominating factor, adding extra computational instructions didn't actually increase the runtime.

That said, even with a lot of optimization and luck, 64-bit floats are slower, so the third factor is that I did a bunch of small optimizations to other parts of the code to speed it up enough to pay back the runtime penalty of the 64-bit operations. In the end I was able to make it net neutral in terms of speed, with the huge audio quality improvement from doubles.

I am extremely pleased that this is no longer an issue!

This website is now on Google Cloud

Captain's Log: Stardate 79462.5

Following up on my last post about issues with Railway for hosting this website, as the title of this post says, it's now running on Google Cloud.

Railway's simplicity was wonderful. Getting a basic web service up and running was incredibly easy, and the UX was clear and straightforward. Railway doesn't try to do everything, which means it isn't massively complicated. And there's really just one way to do each of the things it does, so you don't waste time comparing a bunch of options.

In the end I am simply never going to tolerate my infrastructure provider telling me what software I'm allowed to run. Railway has walked this back a bit, and now allows their software version audit checks to be bypassed, but I've learned something about their philosophy as an infrastructure provider, and want nothing to do with it.

Trying Northflank Hosting

At first I thought maybe I'd just move to one of Railway's direct competitors such as Northflank or Fly.io. I tried Northflank first, and it simply did not work. I got the basic web service configured, which was very easy, though a little harder than Railway because Northflank has far more features to navigate. But in the absolute most basic configuration, external requests to my service were returning HTTP 503 about 30% of the time.

I emailed Northflank's support about the 503 issue, and the CEO replied (I am not sure if this is encouraging or worrying). He suggested that I should bind my service to 0.0.0.0 instead of the default Node setup which was binding to a specific interface. This kind of made sense to me; if there are multiple interfaces and requests may come in on any of them, obviously I need to listen on all of them. I made this change, but it didn't help.

I decided to do some debugging on my own, and from what I can tell, Northflank's proxy infrastructure was not working correctly. I did an HTTP load test by running autocannon locally in my container, and even under extremely heavy load my Node server was returning 100% HTTP 200s. I then used Northflank's command line tool to forward my HTTP port from the container to my workstation, and ran autocannon on my workstation. Again, this produced 100% HTTP 200s. Finally I ran autocannon against the public Northflank address for the server, and it was about 80% HTTP 503s. My guess is that some aspect of Northflank's internal proxy configuration just got set up wrong. The 503 responses included a textual error message that appears to come from Envoy, which I believe is the proxy that Northflank uses internally.

Anyway, I wrote up a detailed explanation of the testing I did as well as the results and sent an email to Northflank's support. They never replied.

After a few days of no help whatsoever from Northflank, I shut down my services and terminated my account. Even if they had replied at this point with a fix, I probably would have abandoned them anyway -- I'd be very scared that even if we got it working, something like this would happen again and I'd be left out to dry.

Google Cloud Run

At this point I had become fed up with the "easy" hosting solutions and decided to just cave in and use a real hosting provider.

I had tried Google Cloud Run maybe a year or two ago, when it was pretty new, and it was just too rough to use. But it has gotten a lot better since then, and is fairly easy to set up. Certainly it is not Railway-level easy, but getting a basic Node web app serving on Cloud Run is pretty dang straightforward.

The flow to get an automatic Github push-triggered release system running is super easy. Like all the other services that do this, you basically just authenticate with Github, tell it what repo/branch to push, and let buildpacks do the rest.

However, literally every other aspect of using Google Cloud is way, way more complicated than with something like Railway. Cloud Run is only easy if you are OK with running in a single region. The moment you want to run in two regions, all the easy point-and-click setup falls apart and you have to write a bunch of super complex YAML configuration instead. And setting up global load balancing is an extraordinarily complex process.

I had flashbacks to back when I worked at Google. We used to joke endlessly about how every single infrastructure project had a deprecated version, and a not-yet-ready version, and this was true. In some cases there were multiple layers of deprecated versions. Every major infrastructure system perpetually had a big program running to get all its users migrated to the new version.

I ran into the deprecated/not-ready paradox twice while setting up the global load balancer.

First, the load balancing strategy is in this state. In true awful Google fashion, it defaulted to creating my load balancer using the classic strategy, which is deprecated, and suddenly I got a ton of notifications to migrate to the new strategy. Why didn't it just default to creating new balancers on the new strategy? Hilariously the migration process can't be done in one step, you have to go through a four-step process of preparing, doing a % migration, etc. I literally laughed out loud when I encountered this. It's just pure distilled Googliness.

Second, I needed to set up an edge SSL certificate. The load balancer setup had a nice simple flow that created one for me with a couple clicks. Great! Except to get the SSL certificate signed, I had to prove that I owned the anukari.com domain by pointing the A record to my Google public IP. Which is impossible, because it is a commercial website serving user traffic, and changing the A record to point to an IP that currently doesn't have an SSL cert will bring the site down for 24-72 hours while Google verifies my domain ownership.

I poked around a bit and found that Google's SSL system has an alternative way to prove domain ownership, via a CNAME record instead of the A record. This would allow me to prove domain ownership without bringing the site down. Then Google could provision the SLL cert and I could cut over my DNS A record with zero downtime. Great! Except in true Google fashion, this feature is only available for the new certificate system, and not the "classic" system that works well with global load balancers.

Google Cloud's solution to this is absolutely comical. There's a special kind of object you create that can be used to "map" new SSL certificates into the classic system and make them available to the global load balancer. Of course there is no point-and-click GUI for this stuff, so you have to run a bunch of extremely obscure gcloud commands to set up the mappings and link them to the load balancer. Again this is just so Googly it was completely comical. When I was inside Google these kinds of things kind of made sense, but I can't believe Google does this stuff to their external customers. Misery loves company, I guess? :)

All of the pain-in-the-butt setup aside, now I have the web service running on Google Cloud, I am quite happy. It performs substantially better than on Railway for about the same price, and I no longer have to worry about my host screwing me over in an emergency by blocking me from pushing whatever software I want to my container. The trade-off is that the configuration is WAY more complex but I can live with that.

Railway.com knows better than you

Captain's Log: Stardate 79429.5

[EDIT1] Railway responded to this post in an HN comment. Quoting from their reply, "I wished the majority of our userbase knew better than us, but the reality is they don't."

[EDIT2] Railway's founder replied in a separate HN comment.

[EDIT3] Railway added an option to bypass the checks.

[EDIT4] I wrote a post about how this website is now on Google Cloud.

My honeymoon period with Railway.com for hosting this website is over.

I recently ran into a bug in my own code that caused some data corruption in my DB. The consequences of the bug were that a user who purchased Anukari could get emailed a license key that had already been sent to another user. This is really bad! Fortunately I noticed it quickly, and the fix was simple. I got the code prepared and went to do an emergency release to push out the fix.

It's worth noting that the bug had been around for a while, but it was latent until a couple days ago. So the fix was not a rollback, which would typically be the ideal way to fix a production issue quickly. The only solution was to roll forward with a patch.

Imagine my surprise when, during my emergency rollout, the release failed with:

==================================================================
SECURITY VULNERABILITIES DETECTED
==================================================================
Railway cannot proceed with deployment due to security
vulnerabilities in your project's dependencies.

Keep in mind that I'm super stressed about the fact that there is a severe and urgent problem with my app in production, which is causing me problems in real time. I've dropped everything else I was doing, and have scrambled to slap together a safe fix. I want to go to bed.

And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.

I'll get back to the actual security vulnerabilities Railway detected, but let's ignore that for the moment and talk about the fact that Railway has just intentionally blocked me from pushing out a release, based on their assessment of the security risk. But how can that possibly make sense? Railway cannot know how urgent my release is. Railway cannot know whether the security vulnerabilities they detected in my app are even exploitable. They're just looking at node package versions marked as vulnerable.

The most ridiculous part of this is that the current production version of my app that I was trying to replace depended on the same vulnerable package versions as the patched version I was trying to push out to fix my urgent bug. So in fact the release added zero additional security risk to what was already serving.

Okay so what were the security vulnerabilities that were so dangerous that Railway.com's nanny system engaged to block me from pushing out an urgent fix? They cited two HIGH risk CVEs: this one and this one. They also cited a MEDIUM risk CVE but I'm not going to cover that.

The HIGH risk CVEs are both DOS issues. Attackers can craft payloads that send the server into an infinite loop, sucking up CPU and denying service.

Do I want to fix those CVEs for my service? Absolutely! Would I like my hosting provider to shoot me an email if they detect that my app has those vulnerabilities? Yes, please!

Do I want to be blocked from pushing urgent fixes to my app when there's zero evidence that either of those CVEs are being exploited for my app in any way? I think not.

I want to push out my fix, go to bed, and then come back the next day and upgrade package versions and fix the CVEs.

Now, look, was it a huge deal to upgrade those packages? Not really. In this case I was lucky that it was a straightforward upgrade. But anyone who's used a package manager knows about dependency hell, where trying to upgrade one package leads to a cascade of other dependencies needing to be updated, possibly with conflicts in those dependencies. And even if there is no dependency hell, any time package versions are changed, some degree of testing is warranted to make sure nothing was broken.

These are things that I did not want to even remotely think about during an urgent incident late in the evening, especially not for package versions that were already live in production. It took a stressful experience and added a huge amount of additional frustration.

Railway sometimes uses the mottos, "Ship without friction," and "Let the engineers ship." This felt like friction. I was literally not allowed to ship.

I complained about this to Railway, and they basically said they need to do this to protect neighboring apps from problems caused by my app. I was under the impression that this was the purpose of containers, but what do I know.

I'm guessing Railway is having issues with apps on their free tier wasting CPU resources, costing them money, and found some free-tier apps succumb to these infinite loop CVEs, wasting free vCPUs. But I'm not on the free tier -- I pay for what I use.

I've been a huge fan of Railway for hosting this site. It really is simple and fast, and that's been perfect for such a simple site. I don't need a lot of fancy features, and Railway has let me deploy my code easily without worrying about the details. But this security overreach gives me pause, so this is a great time to look at other hosting providers.

I talked to the founder of Railway's competitor Northflank about this issue, and his reply was, "We would never block you from deploying if we can build or access a valid container image." To me that's the right answer.

Bird's eye view

When I worked at Google, for a long time I was involved with production reliability/safety. I wasn't an SRE, but I worked on services that were in the critical path for ads revenue. Incidents often cost us untold millions of dollars, so reliability was a huge concern.

I wrote and reviewed many dozens of postmortem analyses, and I learned a lot about the typical ways that systems fail. I was in 24x7 oncall rotations for my entire career at Google, and during the latter years I was oncall as an Incident Commander inside ads, running the responses to (very) large incidents.

Google's ads systems were mostly pretty mature, and contained lots of safety checks, etc. I always found it really fascinating to see the ways that a serious outage would slip through all the defenses.

One pattern I came to recognize was that when deciding on the action items / follow-ups for an incident postmortem, the go-to reaction was always to add another safety check of some kind. This makes sense! Something went wrong, and it slipped past our defenses, so more defenses should help.

However I came to view adding more and more defenses as an anti-pattern. There's a baseline level of safety checks that make a system safer, but if you are not careful, adding more safety checks actually increases risk, because the safety checks themselves start to be the cause of new problems. There's a great word for this, by the way, iatrogenesis.

I'll give an example. I worked in (anti) ad fraud, and we had systems that would block ad requests that looked fraudulent or scammy. Bugs in these systems were extraordinarily dangerous, because in the worst case we could turn off all advertising revenue globally. (Yes, that is a thing that happened. More than once. I have stories. Buy me a beer sometime...)

These systems accumulated many safety checks over the years. A typical thing would be to monitor the amount of revenue being blocked for fraud reasons, and alert a human if that amount grew too quickly.

But in some cases, these safety checks were allowed to automatically disable the system as a fail-safe, if things looked bad enough. The problem is... there's not really any such thing as "failing safe" here. For a spam filter, failing closed means revenue loss, and failing open means allowing fraud to get through.

So imagine that a script kiddie sits down and writes a bash loop to wget an ad click URL 10,000 times per second. The spam filtering systems trivially detect this as bot traffic, and filter it out so that advertisers aren't paying for these supposed clicks.

But maybe some safety check system sees that this trivial spam filter has suddenly started blocking hundreds of thousands of dollars of revenue. That's crazy! The filter must have gone haywire! We better automatically disable the filter to stop the bleeding!

Uh oh. Now those script kiddie requests are being treated as real ad traffic and billed to advertisers. We have ourselves here a major incident, caused by the safety check that was meant to protect us, created as the result of a previous postmortem analysis.

I'm not saying that all safety checks are bad. You certainly need some. But what I am saying is that if you allow extra safety checks to become the knee-jerk reaction to incidents, you will eventually end up with a complex mess where the safety checks themselves become the source of future incidents. You build a Rube Goldberg Safety Machine, but it doesn't work. The marble flies off the rails.

It's really attractive to just keep adding safety checks, but what is actually required is to take a step back, look at the system holistically, and think about simple ways to make it safer. Generally after a few basic safety checks are in place, further safety comes from redesigning the system itself. Or even better, rethinking the problem that you're trying to solve altogether. Maybe there's a completely different approach that side-steps an entire category of reliability issues.

Loading...

FacebookInstagramTikTokBlueskyTwitterDiscordYouTubeRedditAnukari NewsletterAnukari Newsletter RSS Feed
© 2026 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.