devlog > bug
Railway.com knows better than you
Captain's Log: Stardate 79429.5
[EDIT1] Railway responded to this post in an HN comment. Quoting from their reply, "I wished the majority of our userbase knew better than us, but the reality is they don't."
[EDIT2] Railway's founder replied in a separate HN comment.
[EDIT3] Railway added an option to bypass the checks.
[EDIT4] I wrote a post about how this website is now on Google Cloud.
My honeymoon period with Railway.com for hosting this website is over.
I recently ran into a bug in my own code that caused some data corruption in my DB. The consequences of the bug were that a user who purchased Anukari could get emailed a license key that had already been sent to another user. This is really bad! Fortunately I noticed it quickly, and the fix was simple. I got the code prepared and went to do an emergency release to push out the fix.
It's worth noting that the bug had been around for a while, but it was latent until a couple days ago. So the fix was not a rollback, which would typically be the ideal way to fix a production issue quickly. The only solution was to roll forward with a patch.
Imagine my surprise when, during my emergency rollout, the release failed with:
==================================================================
SECURITY VULNERABILITIES DETECTED
==================================================================
Railway cannot proceed with deployment due to security
vulnerabilities in your project's dependencies.
Keep in mind that I'm super stressed about the fact that there is a severe and urgent problem with my app in production, which is causing me problems in real time. I've dropped everything else I was doing, and have scrambled to slap together a safe fix. I want to go to bed.
And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.
I'll get back to the actual security vulnerabilities Railway detected, but let's ignore that for the moment and talk about the fact that Railway has just intentionally blocked me from pushing out a release, based on their assessment of the security risk. But how can that possibly make sense? Railway cannot know how urgent my release is. Railway cannot know whether the security vulnerabilities they detected in my app are even exploitable. They're just looking at node package versions marked as vulnerable.
The most ridiculous part of this is that the current production version of my app that I was trying to replace depended on the same vulnerable package versions as the patched version I was trying to push out to fix my urgent bug. So in fact the release added zero additional security risk to what was already serving.
Okay so what were the security vulnerabilities that were so dangerous that Railway.com's nanny system engaged to block me from pushing out an urgent fix? They cited two HIGH risk CVEs: this one and this one. They also cited a MEDIUM risk CVE but I'm not going to cover that.
The HIGH risk CVEs are both DOS issues. Attackers can craft payloads that send the server into an infinite loop, sucking up CPU and denying service.
Do I want to fix those CVEs for my service? Absolutely! Would I like my hosting provider to shoot me an email if they detect that my app has those vulnerabilities? Yes, please!
Do I want to be blocked from pushing urgent fixes to my app when there's zero evidence that either of those CVEs are being exploited for my app in any way? I think not.
I want to push out my fix, go to bed, and then come back the next day and upgrade package versions and fix the CVEs.
Now, look, was it a huge deal to upgrade those packages? Not really. In this case I was lucky that it was a straightforward upgrade. But anyone who's used a package manager knows about dependency hell, where trying to upgrade one package leads to a cascade of other dependencies needing to be updated, possibly with conflicts in those dependencies. And even if there is no dependency hell, any time package versions are changed, some degree of testing is warranted to make sure nothing was broken.
These are things that I did not want to even remotely think about during an urgent incident late in the evening, especially not for package versions that were already live in production. It took a stressful experience and added a huge amount of additional frustration.
Railway sometimes uses the mottos, "Ship without friction," and "Let the engineers ship." This felt like friction. I was literally not allowed to ship.
I complained about this to Railway, and they basically said they need to do this to protect neighboring apps from problems caused by my app. I was under the impression that this was the purpose of containers, but what do I know.
I'm guessing Railway is having issues with apps on their free tier wasting CPU resources, costing them money, and found some free-tier apps succumb to these infinite loop CVEs, wasting free vCPUs. But I'm not on the free tier -- I pay for what I use.
I've been a huge fan of Railway for hosting this site. It really is simple and fast, and that's been perfect for such a simple site. I don't need a lot of fancy features, and Railway has let me deploy my code easily without worrying about the details. But this security overreach gives me pause, so this is a great time to look at other hosting providers.
I talked to the founder of Railway's competitor Northflank about this issue, and his reply was, "We would never block you from deploying if we can build or access a valid container image." To me that's the right answer.
Bird's eye view
When I worked at Google, for a long time I was involved with production reliability/safety. I wasn't an SRE, but I worked on services that were in the critical path for ads revenue. Incidents often cost us untold millions of dollars, so reliability was a huge concern.
I wrote and reviewed many dozens of postmortem analyses, and I learned a lot about the typical ways that systems fail. I was in 24x7 oncall rotations for my entire career at Google, and during the latter years I was oncall as an Incident Commander inside ads, running the responses to (very) large incidents.
Google's ads systems were mostly pretty mature, and contained lots of safety checks, etc. I always found it really fascinating to see the ways that a serious outage would slip through all the defenses.
One pattern I came to recognize was that when deciding on the action items / follow-ups for an incident postmortem, the go-to reaction was always to add another safety check of some kind. This makes sense! Something went wrong, and it slipped past our defenses, so more defenses should help.
However I came to view adding more and more defenses as an anti-pattern. There's a baseline level of safety checks that make a system safer, but if you are not careful, adding more safety checks actually increases risk, because the safety checks themselves start to be the cause of new problems. There's a great word for this, by the way, iatrogenesis.
I'll give an example. I worked in (anti) ad fraud, and we had systems that would block ad requests that looked fraudulent or scammy. Bugs in these systems were extraordinarily dangerous, because in the worst case we could turn off all advertising revenue globally. (Yes, that is a thing that happened. More than once. I have stories. Buy me a beer sometime...)
These systems accumulated many safety checks over the years. A typical thing would be to monitor the amount of revenue being blocked for fraud reasons, and alert a human if that amount grew too quickly.
But in some cases, these safety checks were allowed to automatically disable the system as a fail-safe, if things looked bad enough. The problem is... there's not really any such thing as "failing safe" here. For a spam filter, failing closed means revenue loss, and failing open means allowing fraud to get through.
So imagine that a script kiddie sits down and writes a bash loop to wget an ad click URL 10,000 times per second. The spam filtering systems trivially detect this as bot traffic, and filter it out so that advertisers aren't paying for these supposed clicks.
But maybe some safety check system sees that this trivial spam filter has suddenly started blocking hundreds of thousands of dollars of revenue. That's crazy! The filter must have gone haywire! We better automatically disable the filter to stop the bleeding!
Uh oh. Now those script kiddie requests are being treated as real ad traffic and billed to advertisers. We have ourselves here a major incident, caused by the safety check that was meant to protect us, created as the result of a previous postmortem analysis.
I'm not saying that all safety checks are bad. You certainly need some. But what I am saying is that if you allow extra safety checks to become the knee-jerk reaction to incidents, you will eventually end up with a complex mess where the safety checks themselves become the source of future incidents. You build a Rube Goldberg Safety Machine, but it doesn't work. The marble flies off the rails.
It's really attractive to just keep adding safety checks, but what is actually required is to take a step back, look at the system holistically, and think about simple ways to make it safer. Generally after a few basic safety checks are in place, further safety comes from redesigning the system itself. Or even better, rethinking the problem that you're trying to solve altogether. Maybe there's a completely different approach that side-steps an entire category of reliability issues.
Working better on some Radeon chips
Captain's Log: Stardate 79013.9
The issue with Radeon
As discussed in a previous post, I've been fighting with Radeon mobile chips, specifically the gfx90c. The problem originally presented with a user that had both an NVIDIA and a Radeon chip, and even though they were using the NVIDIA chip for Anukari, somehow in the 0.9.6 release something changed that caused the Radeon drivers to crash internally (i.e. the drivers did not return an error code, they were simply aborting the process).
I'd like to eventually offer official support for Radeon chips. That's still likely a ways off, but at the very least I don't want things crashing. Anukari is extremely careful about how it interacts with the GPU, and when a particular GPU is not usable, it should (preferred) simply pick a different GPU, or at the very least, show a helpful message in the GUI explaining the situation.
Unfortunately it was difficult to debug this issue remotely. The user was kind enough to run an instrumented binary that confirmed that Anukari was calling clBuildProgram() with perfectly valid arguments, and it was simply aborting. I really needed to run Anukari under a debugger to learn more.
So I found out what laptop my bug-reporting user had, and ordered an inexpensive used Lenovo Ideapad 5 on eBay. I've had to buy a lot of testing hardware, and I've saved thousands of dollars by buying it all second-hand or refurbished. In this case it did take two attempts, as the first Ideapad 5 I received was super broken. But the second one works just fine.
Investigation
After getting the laptop set up and running Anukari under the MSVC debugger, I instantly was seeing debug output like this just prior to the driver crash:
LLVM ERROR: Cannot select: 0x1ce8fdea678:
ch = store<ST2[%arrayidx_v41102(align=4)+2](align=2),
trunc to i16> 0x1ce8fe462a8, 0x1ce8fde8fb8, 0x1ce8fe470b8,
undef:i32
0x1ce8fde8fb8: f32,ch = load<LD2[%1(addrspace=1)(align=8)+294]
(align=2)> 0x1ce8fdd7638, 0x1ce8fde8b80, [...]
First of all, I want to call out AMD on their exceptionally shoddy driver implementation. It's just absurd that they'd allow a compilation error internal to the driver to abort the whole process. Clearly in this case clBuildProgram() should return CL_BUILD_PROGRAM_FAILURE, and the program log (the compiler error text) should be filled with something helpful, at a minimum, the raw LLVM output, but preferably something more readable. This is intern-level code, in a Windows kernel driver. Wow.
After reading through this carefully, all I could really make of it was that LLVM was unable to find a machine instruction to read data from this UpdateEntitiesArguments struct in addrspace=7 and write it to memory in addrspace=1. From context, I could guess that addrspace=1 is private (thread) memory, and addrspace=7 is whatever memory the kernel arguments are stored in. I had a harder time understanding why it couldn't find such an instruction. I thought maybe it had to do with an alignment problem, but wasn't sure.
This struct contains a number of fields, and I couldn't tell from the error which field was the problem. So I just used a brute-force approach and commented out most of the kernel code, and added code back in slowly. It compiled fine until I uncommented a line of code like float x = arguments.field[i]. I did some checking to ensure that field was aligned in a sane way, and after confirming that, I came to the conclusion that the gfx90c chip simply does not have an instruction for loading memory from addrspace=7 with a dynamic offset. In other words, the gfx90c appears to lack the ability to address arrays in argument memory with a non-constant offset.
Which, as far as I can tell, means that the gfx90c really doesn't support OpenCL properly. Every other OpenCL implementation I've used can do this, including NVIDIA, Intel Iris, Apple, and even newer Radeon chips like the gfx1036. I don't see anything in the OpenCL specification that would indicate that this is a limitation.
But even assuming that it's somehow within specs for an OpenCL implementation not to support this feature, obviously aborting in the driver is completely unreasonable behavior. Again, this is a really shoddy implementation, and when people ask about why Anukari doesn't yet officially support Radeon chips, this is the kind of reason that I point to. The drivers are buggy, and worse they are inconsistent across the hardware.
The (very simple) workaround
Anyway, I have very good (performance) reasons for storing some small constant-size arrays (with dynamic indexes) in kernel arguments, but those reasons really apply more to the CUDA backend. So I made some simple changes to Anukari to store these small arrays in constant device memory, and the gfx90c implementation now works just fine.
Given that I upgraded my primary workstation recently to a very new AMD Ryzen CPU, I now have two Radeon test chips: the gfx90c in the Ideapad 5, and the gfx1036 that's built-in to my Ryzen. The Anukari GPU code appears to work flawlessly on both, though doesn't perform all that well on either. Next up will be doing more testing of the Vulkan graphics, which have also been a pain point in the past on Radeon chips.
Multichannel, ASIO, Radeon, and randomization
Captain's Log: Stardate 79000.1
Whoa, it's been way too long since I updated the devlog. Here goes!
2025 MIDI Innovation Awards
Really quickly: Anukari is an entry in the 2025 MIDI Innovation Awards, and I would really appreciate your vote. You can vote on this page by entering your email and then navigating to the Software Prototypes/Non-Commercial Products category and scrolling way down to find Anukari. You have to pick 3 products to vote in that category. (I wish I could link to the vote page directly, but alas, it's not built that way.)
The prize for winning would be a shared booth for Anukari at the NAMM trade show, which would be a big deal for getting the word out.
Multichannel I/O Support
A while back, Joe Williams from CoSTAR LiveLab reached out to me asking if Anukari had multichannel output support. Evidently the UK government is investing in the arts, which as an American is a pretty (literally) foreign concept. One of the labs working on promoting live performance is LiveLab, and they have a big 28-channel Ambisonic dome. Joe saw Anukari and thought it would be cool to create an instrument with 28 mics outputting to those 28 speaker channels.
I'd received several requests for multichannel I/O, but hadn't yet prioritized the work. The LiveLab use case is really cool, though, and Anukari will be featured in a public exhibit later this month, so I decided to prioritize the multichannel work.
Anukari now supports 50x50 input/output channels. In the standalone app, this is really simple, you just enable however many channels your interface supports and then inside Anukari you assign each audio input exciter or mic to the channels you want.
It also works for the plugin, but how you utilize multichannel I/O is very DAW-dependent. Testing the new feature was kind of a pain in the butt, because I have about 15 DAWs for testing, and multichannel is a bit of an advanced feature, so I ended up watching a zillion tutorial videos. Every DAW approaches it a bit differently, and the UX is generally somewhat buried since it's a nice feature. But it works everywhere, and it is extremely cool to be able to map a bunch of mics to their own DAW tracks and give them independent effects chains and so on.
Behind the scenes, it was really important to me that the multichannel support did not impact performance, especially when it was not in use. I'm very happy to say I achieved this goal. When you're not using multichannel I/O, there is zero performance impact. And even in 50x50 mode the impact is very low. Anukari is well-suited for multichannel I/O since each mic is tapping into the same physics simulation at different points/angles, so none of the physics computations have to be repeated/duplicated. Really the only overhead is copying additional buffers into and out of the GPU. On the Windows CUDA backend, that's a single DMA memcpy, which is very fast. And on the macOS Metal backend, it's unified memory, so no overhead at all. All that remains is the CPU-CPU copy into the DAW audio buffers, which is very, very fast.
I look forward to posting about the LiveLab exhibit once it happens.
ASIO Support
It's a pretty big oversight that the Windows version of the Anukari Beta launched without ASIO support. I'm not quite sure how I missed this important feature, but I've added it now.
I think I always assumed it was there, but when using JUCE the ASIO support is not enabled by default because you need to get a countersigned agreement from Steinberg to use their headers to integrate with ASIO. I already had a signed agreement with them for the VST3 support, but ASIO is a completely separate legal agreement and so I went through the steps to get that as well.
ASIO support makes the standalone app perform much better (in terms of latency) for people with ASIO compatible sound interfaces.
AMD Radeon Crashes
Officially speaking, Anukari explicitly does not support AMD Radeon hardware. This is a bit of a long story, which at some point I will write about in more detail. But the short version is that the Radeon drivers are incredibly inconsistent across the Radeon hardware lineup, which makes it extremely difficult to offer full support. For some Radeon users, Anukari works perfectly, and for others it is unstable, glitchy, or crashes, in many different unique ways.
The story I'll write about for this devlog entry, though, is the extremely frustrating case that I solved for users that have both an AMD Radeon and an NVIDIA graphics card in the same machine. This is actually a common situation, because many (most? all?) AMD Ryzen CPUs include integrated Radeon graphics on the CPU die. So for example there are a lot of laptops that come with an NVIDIA graphics card, but also have a sort of "vestigial" Radeon in the CPU that is normally not used for anything.
In the past, Anukari just worked for users with this configuration, since when it detects multiple possible GPUs to use for the simulation, it would automatically select the CUDA one as the default. However in the 0.9.6 release, Anukari began crashing instantly at startup for these users.
This was pretty confusing, because I have comprehensive fuzz and golden tests that exercise all the physics backends (CUDA, OpenCL, Metal). These tests abuse the simulation to an extreme extent, and I run them under various debugging/lint tools to make sure that there are no GPU memory errors, etc. And across my NVIDIA, macOS, and Intel Iris chips, they all work perfectly.
Luckily I had a user who was extremely generous with their time to help me debug the issue. I sent them instrumented Anukari binaries, and eventually was able to pinpoint that it was crashing inside the clBuildProgram() call.
Now, you might think that what I mean is that clBuildProgram() was returning an error code, and I was somehow not handling it. No, Anukari is extremely robust about error checking. I mean it was crashing inside the kernel and clBuildProgram() was not returning at all due to the process aborting. This is with perfectly valid arguments to the function. So, obviously, this is a horrible bug in the AMD drivers. Even if the textual content of the kernel has e.g. a syntax error, clearly clBuildProgram() should return an error code rather than crash.
The really fun part is that I've only seen this crash on the hardware identifying as gfx90c. On other Radeons, this does not happen (though some of them fail in other ways). This is what I mean about the AMD drivers being extremely inconsistent.
Now, as to why this crash happened at startup, it's because during device discovery Anukari was compiling the physics kernel, and any device where compilation failed would be assumed incompatible and omitted from the list of possible backends. I added this feature after encountering other broken OpenCL implementations like the Microsoft OpenCL™, OpenGL®, and Vulkan® Compatibility Pack which is an absolute disaster.
So the workaround for now is that Anukari no longer does a test compilation to detect bad backends. This resolves the issue, although if the user manually chooses the Radeon backend on gfx90c it will unrecoverably crash Anukari.
Longer-term, given the Radeon driver bugs, I doubt I'll ever be able to fully support gfx90c, but I ordered a cheap used laptop off eBay with that chip in it so that I can at least narrow down what OpenCL code is causing the driver to crash. I know that it's something the driver doesn't like about the OpenCL code because it did not always crash, and the only difference in the meantime has been some improvements to that code. Hopefully I can find a workaround to avoid the driver bug, but if not I might add a rule in Anukari to ignore all gfx90c chips.
(Side-note: actually the first used laptop with a gfx90c chip that I bought off eBay was bluescreening at boot, so I had to buy a second one. These inexpensive Radeon laptops are really bad.)
Not all hope is lost for Radeon support. I recently upgraded my main development machine, and the Ryzen CPU I bought has an on-die Radeon, and it works flawlessly with Anukari. So maybe what I will be able to do one day is create an allow-list for Radeon devices that work correctly without driver issues. Sigh. It is so much easier with NVIDIA and Apple.
Parameter Randomization
Unlike the features above, this one hasn't been released yet, but I recently completed work to allow parameters to be randomized.
For Anukari this turned out to be a bit of a design challenge, since the sliders that are used to edit parameters are a bit complex already. The tricky bit is that if the user has a bunch of entities selected, the slider edits them all. And if the parameter values for each entity vary, the slider turns into a "range editor" which can stretch/squeeze/slide the range of values.
So the randomize button needs to handle both the "every selected object has the same parameter value" and "the parameter varies" scenarios. For the first scenario with a singleton value, it's simple: pressing the button just picks a random value across the full range of the parameter and assigns it to all the objects.
But for the "range editor" scenario, what you really want is for the randomize button to pick different random values for each entity, within the range that you have chosen. There's one tricky issue here, which is that it is very normal for the user to want to mash the randomize button repeatedly until they get a result they like. This will result in the range of values shrinking each time (since it's very unlikely that the new random values will have the same range as before, and the range can only be smaller)!
So the slider needs to remember the original range when the user started mashing the randomize button, and to reuse that original range for each randomization. This allows button mashing without having the range shrink to nothing. It's important, though, that this remembered range is forgotten when the user adjusts the slider manually, so that they can choose a new range to randomize within.
Another kind of weird case is when the slider is currently in singleton mode, meaning that all the entities have the same parameter value, and the user wants to spread them out randomly over a range. This could be done by deselecting the group of entities, selecting just one of them, changing its value, then reselecting the whole group, which would put the slider into range mode. But that's awfully annoying.
I ended up adding a feature where you can now right click on a singleton slider, and it will automatically be split into a range slider. The lower/upper values for the range will be just slightly below/above the singleton value, and the values will be randomly distributed inside that range. So now you can just right click to split, adjust the range, and mash the randomize button.
The Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
VST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.