devlog > web

RSS Feed
Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

This website is now on Google Cloud

Captain's Log: Stardate 79462.5

Following up on my last post about issues with Railway for hosting this website, as the title of this post says, it's now running on Google Cloud.

Railway's simplicity was wonderful. Getting a basic web service up and running was incredibly easy, and the UX was clear and straightforward. Railway doesn't try to do everything, which means it isn't massively complicated. And there's really just one way to do each of the things it does, so you don't waste time comparing a bunch of options.

In the end I am simply never going to tolerate my infrastructure provider telling me what software I'm allowed to run. Railway has walked this back a bit, and now allows their software version audit checks to be bypassed, but I've learned something about their philosophy as an infrastructure provider, and want nothing to do with it.

Trying Northflank Hosting

At first I thought maybe I'd just move to one of Railway's direct competitors such as Northflank or Fly.io. I tried Northflank first, and it simply did not work. I got the basic web service configured, which was very easy, though a little harder than Railway because Northflank has far more features to navigate. But in the absolute most basic configuration, external requests to my service were returning HTTP 503 about 30% of the time.

I emailed Northflank's support about the 503 issue, and the CEO replied (I am not sure if this is encouraging or worrying). He suggested that I should bind my service to 0.0.0.0 instead of the default Node setup which was binding to a specific interface. This kind of made sense to me; if there are multiple interfaces and requests may come in on any of them, obviously I need to listen on all of them. I made this change, but it didn't help.

I decided to do some debugging on my own, and from what I can tell, Northflank's proxy infrastructure was not working correctly. I did an HTTP load test by running autocannon locally in my container, and even under extremely heavy load my Node server was returning 100% HTTP 200s. I then used Northflank's command line tool to forward my HTTP port from the container to my workstation, and ran autocannon on my workstation. Again, this produced 100% HTTP 200s. Finally I ran autocannon against the public Northflank address for the server, and it was about 80% HTTP 503s. My guess is that some aspect of Northflank's internal proxy configuration just got set up wrong. The 503 responses included a textual error message that appears to come from Envoy, which I believe is the proxy that Northflank uses internally.

Anyway, I wrote up a detailed explanation of the testing I did as well as the results and sent an email to Northflank's support. They never replied.

After a few days of no help whatsoever from Northflank, I shut down my services and terminated my account. Even if they had replied at this point with a fix, I probably would have abandoned them anyway -- I'd be very scared that even if we got it working, something like this would happen again and I'd be left out to dry.

Google Cloud Run

At this point I had become fed up with the "easy" hosting solutions and decided to just cave in and use a real hosting provider.

I had tried Google Cloud Run maybe a year or two ago, when it was pretty new, and it was just too rough to use. But it has gotten a lot better since then, and is fairly easy to set up. Certainly it is not Railway-level easy, but getting a basic Node web app serving on Cloud Run is pretty dang straightforward.

The flow to get an automatic Github push-triggered release system running is super easy. Like all the other services that do this, you basically just authenticate with Github, tell it what repo/branch to push, and let buildpacks do the rest.

However, literally every other aspect of using Google Cloud is way, way more complicated than with something like Railway. Cloud Run is only easy if you are OK with running in a single region. The moment you want to run in two regions, all the easy point-and-click setup falls apart and you have to write a bunch of super complex YAML configuration instead. And setting up global load balancing is an extraordinarily complex process.

I had flashbacks to back when I worked at Google. We used to joke endlessly about how every single infrastructure project had a deprecated version, and a not-yet-ready version, and this was true. In some cases there were multiple layers of deprecated versions. Every major infrastructure system perpetually had a big program running to get all its users migrated to the new version.

I ran into the deprecated/not-ready paradox twice while setting up the global load balancer.

First, the load balancing strategy is in this state. In true awful Google fashion, it defaulted to creating my load balancer using the classic strategy, which is deprecated, and suddenly I got a ton of notifications to migrate to the new strategy. Why didn't it just default to creating new balancers on the new strategy? Hilariously the migration process can't be done in one step, you have to go through a four-step process of preparing, doing a % migration, etc. I literally laughed out loud when I encountered this. It's just pure distilled Googliness.

Second, I needed to set up an edge SSL certificate. The load balancer setup had a nice simple flow that created one for me with a couple clicks. Great! Except to get the SSL certificate signed, I had to prove that I owned the anukari.com domain by pointing the A record to my Google public IP. Which is impossible, because it is a commercial website serving user traffic, and changing the A record to point to an IP that currently doesn't have an SSL cert will bring the site down for 24-72 hours while Google verifies my domain ownership.

I poked around a bit and found that Google's SSL system has an alternative way to prove domain ownership, via a CNAME record instead of the A record. This would allow me to prove domain ownership without bringing the site down. Then Google could provision the SLL cert and I could cut over my DNS A record with zero downtime. Great! Except in true Google fashion, this feature is only available for the new certificate system, and not the "classic" system that works well with global load balancers.

Google Cloud's solution to this is absolutely comical. There's a special kind of object you create that can be used to "map" new SSL certificates into the classic system and make them available to the global load balancer. Of course there is no point-and-click GUI for this stuff, so you have to run a bunch of extremely obscure gcloud commands to set up the mappings and link them to the load balancer. Again this is just so Googly it was completely comical. When I was inside Google these kinds of things kind of made sense, but I can't believe Google does this stuff to their external customers. Misery loves company, I guess? :)

All of the pain-in-the-butt setup aside, now I have the web service running on Google Cloud, I am quite happy. It performs substantially better than on Railway for about the same price, and I no longer have to worry about my host screwing me over in an emergency by blocking me from pushing whatever software I want to my container. The trade-off is that the configuration is WAY more complex but I can live with that.

Railway.com knows better than you

Captain's Log: Stardate 79429.5

[EDIT1] Railway responded to this post in an HN comment. Quoting from their reply, "I wished the majority of our userbase knew better than us, but the reality is they don't."

[EDIT2] Railway's founder replied in a separate HN comment.

[EDIT3] Railway added an option to bypass the checks.

[EDIT4] I wrote a post about how this website is now on Google Cloud.

My honeymoon period with Railway.com for hosting this website is over.

I recently ran into a bug in my own code that caused some data corruption in my DB. The consequences of the bug were that a user who purchased Anukari could get emailed a license key that had already been sent to another user. This is really bad! Fortunately I noticed it quickly, and the fix was simple. I got the code prepared and went to do an emergency release to push out the fix.

It's worth noting that the bug had been around for a while, but it was latent until a couple days ago. So the fix was not a rollback, which would typically be the ideal way to fix a production issue quickly. The only solution was to roll forward with a patch.

Imagine my surprise when, during my emergency rollout, the release failed with:

==================================================================
SECURITY VULNERABILITIES DETECTED
==================================================================
Railway cannot proceed with deployment due to security
vulnerabilities in your project's dependencies.

Keep in mind that I'm super stressed about the fact that there is a severe and urgent problem with my app in production, which is causing me problems in real time. I've dropped everything else I was doing, and have scrambled to slap together a safe fix. I want to go to bed.

And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.

I'll get back to the actual security vulnerabilities Railway detected, but let's ignore that for the moment and talk about the fact that Railway has just intentionally blocked me from pushing out a release, based on their assessment of the security risk. But how can that possibly make sense? Railway cannot know how urgent my release is. Railway cannot know whether the security vulnerabilities they detected in my app are even exploitable. They're just looking at node package versions marked as vulnerable.

The most ridiculous part of this is that the current production version of my app that I was trying to replace depended on the same vulnerable package versions as the patched version I was trying to push out to fix my urgent bug. So in fact the release added zero additional security risk to what was already serving.

Okay so what were the security vulnerabilities that were so dangerous that Railway.com's nanny system engaged to block me from pushing out an urgent fix? They cited two HIGH risk CVEs: this one and this one. They also cited a MEDIUM risk CVE but I'm not going to cover that.

The HIGH risk CVEs are both DOS issues. Attackers can craft payloads that send the server into an infinite loop, sucking up CPU and denying service.

Do I want to fix those CVEs for my service? Absolutely! Would I like my hosting provider to shoot me an email if they detect that my app has those vulnerabilities? Yes, please!

Do I want to be blocked from pushing urgent fixes to my app when there's zero evidence that either of those CVEs are being exploited for my app in any way? I think not.

I want to push out my fix, go to bed, and then come back the next day and upgrade package versions and fix the CVEs.

Now, look, was it a huge deal to upgrade those packages? Not really. In this case I was lucky that it was a straightforward upgrade. But anyone who's used a package manager knows about dependency hell, where trying to upgrade one package leads to a cascade of other dependencies needing to be updated, possibly with conflicts in those dependencies. And even if there is no dependency hell, any time package versions are changed, some degree of testing is warranted to make sure nothing was broken.

These are things that I did not want to even remotely think about during an urgent incident late in the evening, especially not for package versions that were already live in production. It took a stressful experience and added a huge amount of additional frustration.

Railway sometimes uses the mottos, "Ship without friction," and "Let the engineers ship." This felt like friction. I was literally not allowed to ship.

I complained about this to Railway, and they basically said they need to do this to protect neighboring apps from problems caused by my app. I was under the impression that this was the purpose of containers, but what do I know.

I'm guessing Railway is having issues with apps on their free tier wasting CPU resources, costing them money, and found some free-tier apps succumb to these infinite loop CVEs, wasting free vCPUs. But I'm not on the free tier -- I pay for what I use.

I've been a huge fan of Railway for hosting this site. It really is simple and fast, and that's been perfect for such a simple site. I don't need a lot of fancy features, and Railway has let me deploy my code easily without worrying about the details. But this security overreach gives me pause, so this is a great time to look at other hosting providers.

I talked to the founder of Railway's competitor Northflank about this issue, and his reply was, "We would never block you from deploying if we can build or access a valid container image." To me that's the right answer.

Bird's eye view

When I worked at Google, for a long time I was involved with production reliability/safety. I wasn't an SRE, but I worked on services that were in the critical path for ads revenue. Incidents often cost us untold millions of dollars, so reliability was a huge concern.

I wrote and reviewed many dozens of postmortem analyses, and I learned a lot about the typical ways that systems fail. I was in 24x7 oncall rotations for my entire career at Google, and during the latter years I was oncall as an Incident Commander inside ads, running the responses to (very) large incidents.

Google's ads systems were mostly pretty mature, and contained lots of safety checks, etc. I always found it really fascinating to see the ways that a serious outage would slip through all the defenses.

One pattern I came to recognize was that when deciding on the action items / follow-ups for an incident postmortem, the go-to reaction was always to add another safety check of some kind. This makes sense! Something went wrong, and it slipped past our defenses, so more defenses should help.

However I came to view adding more and more defenses as an anti-pattern. There's a baseline level of safety checks that make a system safer, but if you are not careful, adding more safety checks actually increases risk, because the safety checks themselves start to be the cause of new problems. There's a great word for this, by the way, iatrogenesis.

I'll give an example. I worked in (anti) ad fraud, and we had systems that would block ad requests that looked fraudulent or scammy. Bugs in these systems were extraordinarily dangerous, because in the worst case we could turn off all advertising revenue globally. (Yes, that is a thing that happened. More than once. I have stories. Buy me a beer sometime...)

These systems accumulated many safety checks over the years. A typical thing would be to monitor the amount of revenue being blocked for fraud reasons, and alert a human if that amount grew too quickly.

But in some cases, these safety checks were allowed to automatically disable the system as a fail-safe, if things looked bad enough. The problem is... there's not really any such thing as "failing safe" here. For a spam filter, failing closed means revenue loss, and failing open means allowing fraud to get through.

So imagine that a script kiddie sits down and writes a bash loop to wget an ad click URL 10,000 times per second. The spam filtering systems trivially detect this as bot traffic, and filter it out so that advertisers aren't paying for these supposed clicks.

But maybe some safety check system sees that this trivial spam filter has suddenly started blocking hundreds of thousands of dollars of revenue. That's crazy! The filter must have gone haywire! We better automatically disable the filter to stop the bleeding!

Uh oh. Now those script kiddie requests are being treated as real ad traffic and billed to advertisers. We have ourselves here a major incident, caused by the safety check that was meant to protect us, created as the result of a previous postmortem analysis.

I'm not saying that all safety checks are bad. You certainly need some. But what I am saying is that if you allow extra safety checks to become the knee-jerk reaction to incidents, you will eventually end up with a complex mess where the safety checks themselves become the source of future incidents. You build a Rube Goldberg Safety Machine, but it doesn't work. The marble flies off the rails.

It's really attractive to just keep adding safety checks, but what is actually required is to take a step back, look at the system holistically, and think about simple ways to make it safer. Generally after a few basic safety checks are in place, further safety comes from redesigning the system itself. Or even better, rethinking the problem that you're trying to solve altogether. Maybe there's a completely different approach that side-steps an entire category of reliability issues.

Preparing for the open (paid) Beta

Captain's Log: Stardate 78674.7

It's been a while since I wrote a devlog update, but I promise it's not for a lack of stuff to talk about. Rather, it's the opposite: I've been so busy with Anukari work I haven't had time to write up a devlog entry!

Anukari is feeling extremely stable at this point, and I'm up to about 85 factory presets. I did a pass through all the existing presets and made sure that all of them have about 2 DAW host automation parameters, which made them dramatically more fun to play with. And, in an exciting development, I've contracted with someone awesome to make more presets, along with tutorial videos and a few other things.

With someone else working on presets and tutorials, I've been free the last week or two to work on preparing for the open Beta. This will be the first public (non-invitation) release of Anukari, which is extremely exciting. While there is still a little work left in the Anukari product itself, mostly revolving around implementing the free trial mode, adding some first-time startup screens, etc, what I've been working on recently is getting the web site ready.

The biggest thing was figuring out how I want to do payments. I toyed with a bunch of possibilities, but in the end I settled on using Shopify. The biggest part of my rationale is that, at least in the US, Shopify has become so ubiquitous that I think when I direct my customers to the Shopify checkout flow, they'll be very comfortable with it, and will probably trust Shopify with their credit card number much more than if I implemented CC# collection myself.

But also, using Shopify saves me from a bunch of other tedious work, like implementing admin dashboards for looking at order history, issuing refunds, and so on. The basic integration with Shopify really only took a couple of days.

Overall I'm happy enough with Shopify. But of course I have complaints. My biggest gripe is that Shopify has to do with Shopify's developer ecosystem. It seems to me that a large part of their success is their "partner program," which is a Shopify-supported system for software developers to offer their services to businesses that want Shopify storefronts. So far so good -- this makes a ton of sense, as most local/small businesses that use Shopify are not going to have software engineers to do this stuff.

However my gripe has to do with the app ecosystem. Shopify developers can create apps that add helpful features to a store, which shop owners can then install. So for example, there's an app to add a warning to a cart when too many of one item are added to the cart. Many of the apps are extremely simple stuff like that. Which sounds fine, right?

It is not fine. The problem is that since developers sell these Shopify apps on Shopify's app store, Shopify gets a cut. And all the apps I've seen have a recurring subscription payment. The simple 10-lines-of-code app I described above is $6/month. So, because Shopify gets a cut of $6/month, they are hugely disincentivized to add these kinds of simple features to the core platform. They are also disincentivized to make the core platform easier to use in general, because then maybe users would solve their own problems without paying $6/month to solve them.

Obviously for me this is not such a big problem, since I'm a software engineer and can mostly do these things myself. Though, I'd really prefer not to have to learn any more about Shopify's reprehensible template system than I have already. But for local/small business owners, Shopify's app store is just going to nickel-and-dime them to death. I can easily see a specialty business having to pay subscriptions for a handful of apps (many of which cost a lot more than $6/month) just to do the basic things they need.

Anyway, that's enough ranting about Shopify. I have things pretty much working, so I shouldn't complain too much.

Loading...

FacebookInstagramTikTokBlueskyTwitterDiscordYouTubeRedditAnukari NewsletterAnukari Newsletter RSS Feed
© 2026 Anukari LLC, All Rights Reserved
Contact Us|Legal
Audio Units LogoThe Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
Steinberg VST LogoVST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.