devlog > web
Railway.com knows better than you
Captain's Log: Stardate 79429.5
[EDIT1] Railway responded to this post in an HN comment. Quoting from their reply, "I wished the majority of our userbase knew better than us, but the reality is they don't."
[EDIT2] Railway's founder replied in a separate HN comment.
My honeymoon period with Railway.com for hosting this website is over.
I recently ran into a bug in my own code that caused some data corruption in my DB. The consequences of the bug were that a user who purchased Anukari could get emailed a license key that had already been sent to another user. This is really bad! Fortunately I noticed it quickly, and the fix was simple. I got the code prepared and went to do an emergency release to push out the fix.
It's worth noting that the bug had been around for a while, but it was latent until a couple days ago. So the fix was not a rollback, which would typically be the ideal way to fix a production issue quickly. The only solution was to roll forward with a patch.
Imagine my surprise when, during my emergency rollout, the release failed with:
==================================================================
SECURITY VULNERABILITIES DETECTED
==================================================================
Railway cannot proceed with deployment due to security
vulnerabilities in your project's dependencies.
Keep in mind that I'm super stressed about the fact that there is a severe and urgent problem with my app in production, which is causing me problems in real time. I've dropped everything else I was doing, and have scrambled to slap together a safe fix. I want to go to bed.
And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.
I'll get back to the actual security vulnerabilities Railway detected, but let's ignore that for the moment and talk about the fact that Railway has just intentionally blocked me from pushing out a release, based on their assessment of the security risk. But how can that possibly make sense? Railway cannot know how urgent my release is. Railway cannot know whether the security vulnerabilities they detected in my app are even exploitable. They're just looking at node package versions marked as vulnerable.
The most ridiculous part of this is that the current production version of my app that I was trying to replace depended on the same vulnerable package versions as the patched version I was trying to push out to fix my urgent bug. So in fact the release added zero additional security risk to what was already serving.
Okay so what were the security vulnerabilities that were so dangerous that Railway.com's nanny system engaged to block me from pushing out an urgent fix? They cited two HIGH risk CVEs: this one and this one. They also cited a MEDIUM risk CVE but I'm not going to cover that.
The HIGH risk CVEs are both DOS issues. Attackers can craft payloads that send the server into an infinite loop, sucking up CPU and denying service.
Do I want to fix those CVEs for my service? Absolutely! Would I like my hosting provider to shoot me an email if they detect that my app has those vulnerabilities? Yes, please!
Do I want to be blocked from pushing urgent fixes to my app when there's zero evidence that either of those CVEs are being exploited for my app in any way? I think not.
I want to push out my fix, go to bed, and then come back the next day and upgrade package versions and fix the CVEs.
Now, look, was it a huge deal to upgrade those packages? Not really. In this case I was lucky that it was a straightforward upgrade. But anyone who's used a package manager knows about dependency hell, where trying to upgrade one package leads to a cascade of other dependencies needing to be updated, possibly with conflicts in those dependencies. And even if there is no dependency hell, any time package versions are changed, some degree of testing is warranted to make sure nothing was broken.
These are things that I did not want to even remotely think about during an urgent incident late in the evening, especially not for package versions that were already live in production. It took a stressful experience and added a huge amount of additional frustration.
Railway sometimes uses the mottos, "Ship without friction," and "Let the engineers ship." This felt like friction. I was literally not allowed to ship.
I complained about this to Railway, and they basically said they need to do this to protect neighboring apps from problems caused by my app. I was under the impression that this was the purpose of containers, but what do I know.
I'm guessing Railway is having issues with apps on their free tier wasting CPU resources, costing them money, and found some free-tier apps succumb to these infinite loop CVEs, wasting free vCPUs. But I'm not on the free tier -- I pay for what I use.
I've been a huge fan of Railway for hosting this site. It really is simple and fast, and that's been perfect for such a simple site. I don't need a lot of fancy features, and Railway has let me deploy my code easily without worrying about the details. But this security overreach gives me pause, so this is a great time to look at other hosting providers.
I talked to the founder of Railway's competitor Northflank about this issue, and his reply was, "We would never block you from deploying if we can build or access a valid container image." To me that's the right answer.
Bird's eye view
When I worked at Google, for a long time I was involved with production reliability/safety. I wasn't an SRE, but I worked on services that were in the critical path for ads revenue. Incidents often cost us untold millions of dollars, so reliability was a huge concern.
I wrote and reviewed many dozens of postmortem analyses, and I learned a lot about the typical ways that systems fail. I was in 24x7 oncall rotations for my entire career at Google, and during the latter years I was oncall as an Incident Commander inside ads, running the responses to (very) large incidents.
Google's ads systems were mostly pretty mature, and contained lots of safety checks, etc. I always found it really fascinating to see the ways that a serious outage would slip through all the defenses.
One pattern I came to recognize was that when deciding on the action items / follow-ups for an incident postmortem, the go-to reaction was always to add another safety check of some kind. This makes sense! Something went wrong, and it slipped past our defenses, so more defenses should help.
However I came to view adding more and more defenses as an anti-pattern. There's a baseline level of safety checks that make a system safer, but if you are not careful, adding more safety checks actually increases risk, because the safety checks themselves start to be the cause of new problems. There's a great word for this, by the way, iatrogenesis.
I'll give an example. I worked in (anti) ad fraud, and we had systems that would block ad requests that looked fraudulent or scammy. Bugs in these systems were extraordinarily dangerous, because in the worst case we could turn off all advertising revenue globally. (Yes, that is a thing that happened. More than once. I have stories. Buy me a beer sometime...)
These systems accumulated many safety checks over the years. A typical thing would be to monitor the amount of revenue being blocked for fraud reasons, and alert a human if that amount grew too quickly.
But in some cases, these safety checks were allowed to automatically disable the system as a fail-safe, if things looked bad enough. The problem is... there's not really any such thing as "failing safe" here. For a spam filter, failing closed means revenue loss, and failing open means allowing fraud to get through.
So imagine that a script kiddie sits down and writes a bash loop to wget an ad click URL 10,000 times per second. The spam filtering systems trivially detect this as bot traffic, and filter it out so that advertisers aren't paying for these supposed clicks.
But maybe some safety check system sees that this trivial spam filter has suddenly started blocking hundreds of thousands of dollars of revenue. That's crazy! The filter must have gone haywire! We better automatically disable the filter to stop the bleeding!
Uh oh. Now those script kiddie requests are being treated as real ad traffic and billed to advertisers. We have ourselves here a major incident, caused by the safety check that was meant to protect us, created as the result of a previous postmortem analysis.
I'm not saying that all safety checks are bad. You certainly need some. But what I am saying is that if you allow extra safety checks to become the knee-jerk reaction to incidents, you will eventually end up with a complex mess where the safety checks themselves become the source of future incidents. You build a Rube Goldberg Safety Machine, but it doesn't work. The marble flies off the rails.
It's really attractive to just keep adding safety checks, but what is actually required is to take a step back, look at the system holistically, and think about simple ways to make it safer. Generally after a few basic safety checks are in place, further safety comes from redesigning the system itself. Or even better, rethinking the problem that you're trying to solve altogether. Maybe there's a completely different approach that side-steps an entire category of reliability issues.
Preparing for the open (paid) Beta
Captain's Log: Stardate 78674.7
It's been a while since I wrote a devlog update, but I promise it's not for a lack of stuff to talk about. Rather, it's the opposite: I've been so busy with Anukari work I haven't had time to write up a devlog entry!
Anukari is feeling extremely stable at this point, and I'm up to about 85 factory presets. I did a pass through all the existing presets and made sure that all of them have about 2 DAW host automation parameters, which made them dramatically more fun to play with. And, in an exciting development, I've contracted with someone awesome to make more presets, along with tutorial videos and a few other things.
With someone else working on presets and tutorials, I've been free the last week or two to work on preparing for the open Beta. This will be the first public (non-invitation) release of Anukari, which is extremely exciting. While there is still a little work left in the Anukari product itself, mostly revolving around implementing the free trial mode, adding some first-time startup screens, etc, what I've been working on recently is getting the web site ready.
The biggest thing was figuring out how I want to do payments. I toyed with a bunch of possibilities, but in the end I settled on using Shopify. The biggest part of my rationale is that, at least in the US, Shopify has become so ubiquitous that I think when I direct my customers to the Shopify checkout flow, they'll be very comfortable with it, and will probably trust Shopify with their credit card number much more than if I implemented CC# collection myself.
But also, using Shopify saves me from a bunch of other tedious work, like implementing admin dashboards for looking at order history, issuing refunds, and so on. The basic integration with Shopify really only took a couple of days.
Overall I'm happy enough with Shopify. But of course I have complaints. My biggest gripe is that Shopify has to do with Shopify's developer ecosystem. It seems to me that a large part of their success is their "partner program," which is a Shopify-supported system for software developers to offer their services to businesses that want Shopify storefronts. So far so good -- this makes a ton of sense, as most local/small businesses that use Shopify are not going to have software engineers to do this stuff.
However my gripe has to do with the app ecosystem. Shopify developers can create apps that add helpful features to a store, which shop owners can then install. So for example, there's an app to add a warning to a cart when too many of one item are added to the cart. Many of the apps are extremely simple stuff like that. Which sounds fine, right?
It is not fine. The problem is that since developers sell these Shopify apps on Shopify's app store, Shopify gets a cut. And all the apps I've seen have a recurring subscription payment. The simple 10-lines-of-code app I described above is $6/month. So, because Shopify gets a cut of $6/month, they are hugely disincentivized to add these kinds of simple features to the core platform. They are also disincentivized to make the core platform easier to use in general, because then maybe users would solve their own problems without paying $6/month to solve them.
Obviously for me this is not such a big problem, since I'm a software engineer and can mostly do these things myself. Though, I'd really prefer not to have to learn any more about Shopify's reprehensible template system than I have already. But for local/small business owners, Shopify's app store is just going to nickel-and-dime them to death. I can easily see a specialty business having to pay subscriptions for a handful of apps (many of which cost a lot more than $6/month) just to do the basic things they need.
Anyway, that's enough ranting about Shopify. I have things pretty much working, so I shouldn't complain too much.
Writing bad crypto code
Captain's Log: Stardate 78089.3
Today I tried to slam out as much work as I could on the lower-level c++ code to do all the license activation stuff. Since I got all the PITA stuff done yesterday to get openssl, libcurl, etc, working, today was actually fairly productive. The c++ code now has APIs to call the license API server to register a new license key, save the signed key to disk, check if the key is still valid, and so on.
I find all this anti-piracy stuff pretty annoying/frustrating, since none of the work makes the plugin more fun to use. And one particular worry I have is that I want to make VERY sure that it never screws over someone who has paid for the software. So I have been thinking a lot about testing the license activation code. I wrote a few unit tests, but eventually got to the HTTP parts, and mocking out the HTTP responses just didn't feel great.
The solution I came to was to actually just write a bunch of integration tests that have the c++ license activation/verification code call the production license API server. I added some hidden/test-only APIs that allow the integration tests to reset the server state, and created a test-only license key. So the tests actually go through all the real flows a user will go through and make sure the c++ code and the API server talk to one another correctly, and that the API database reflects the correct state, etc.
I'm quite happy with this testing solution, because it gives me a ton of confidence that the client/server interactions all work the way they should. And since nothing is mocked, it's a really faithful test of what the actual binary will do.
Tomorrow I'll start on the c++ GUI for entering the license key and so on.
The Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
VST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.