devlog

Way more detail than you ever wanted to know about the development of the Anukari 3D Physics Synthesizer [see archive]

This website is now on Google Cloud

Captain's Log: Stardate 79462.5

Following up on my last post about issues with Railway for hosting this website, as the title of this post says, it's now running on Google Cloud.

Railway's simplicity was wonderful. Getting a basic web service up and running was incredibly easy, and the UX was clear and straightforward. Railway doesn't try to do everything, which means it isn't massively complicated. And there's really just one way to do each of the things it does, so you don't waste time comparing a bunch of options.

In the end I am simply never going to tolerate my infrastructure provider telling me what software I'm allowed to run. Railway has walked this back a bit, and now allows their software version audit checks to be bypassed, but I've learned something about their philosophy as an infrastructure provider, and want nothing to do with it.

Trying Northflank Hosting

At first I thought maybe I'd just move to one of Railway's direct competitors such as Northflank or Fly.io. I tried Northflank first, and it simply did not work. I got the basic web service configured, which was very easy, though a little harder than Railway because Northflank has far more features to navigate. But in the absolute most basic configuration, external requests to my service were returning HTTP 503 about 30% of the time.

I emailed Northflank's support about the 503 issue, and the CEO replied (I am not sure if this is encouraging or worrying). He suggested that I should bind my service to 0.0.0.0 instead of the default Node setup which was binding to a specific interface. This kind of made sense to me; if there are multiple interfaces and requests may come in on any of them, obviously I need to listen on all of them. I made this change, but it didn't help.

I decided to do some debugging on my own, and from what I can tell, Northflank's proxy infrastructure was not working correctly. I did an HTTP load test by running autocannon locally in my container, and even under extremely heavy load my Node server was returning 100% HTTP 200s. I then used Northflank's command line tool to forward my HTTP port from the container to my workstation, and ran autocannon on my workstation. Again, this produced 100% HTTP 200s. Finally I ran autocannon against the public Northflank address for the server, and it was about 80% HTTP 503s. My guess is that some aspect of Northflank's internal proxy configuration just got set up wrong. The 503 responses included a textual error message that appears to come from Envoy, which I believe is the proxy that Northflank uses internally.

Anyway, I wrote up a detailed explanation of the testing I did as well as the results and sent an email to Northflank's support. They never replied.

After a few days of no help whatsoever from Northflank, I shut down my services and terminated my account. Even if they had replied at this point with a fix, I probably would have abandoned them anyway -- I'd be very scared that even if we got it working, something like this would happen again and I'd be left out to dry.

Google Cloud Run

At this point I had become fed up with the "easy" hosting solutions and decided to just cave in and use a real hosting provider.

I had tried Google Cloud Run maybe a year or two ago, when it was pretty new, and it was just too rough to use. But it has gotten a lot better since then, and is fairly easy to set up. Certainly it is not Railway-level easy, but getting a basic Node web app serving on Cloud Run is pretty dang straightforward.

The flow to get an automatic Github push-triggered release system running is super easy. Like all the other services that do this, you basically just authenticate with Github, tell it what repo/branch to push, and let buildpacks do the rest.

However, literally every other aspect of using Google Cloud is way, way more complicated than with something like Railway. Cloud Run is only easy if you are OK with running in a single region. The moment you want to run in two regions, all the easy point-and-click setup falls apart and you have to write a bunch of super complex YAML configuration instead. And setting up global load balancing is an extraordinarily complex process.

I had flashbacks to back when I worked at Google. We used to joke endlessly about how every single infrastructure project had a deprecated version, and a not-yet-ready version, and this was true. In some cases there were multiple layers of deprecated versions. Every major infrastructure system perpetually had a big program running to get all its users migrated to the new version.

I ran into the deprecated/not-ready paradox twice while setting up the global load balancer.

First, the load balancing strategy is in this state. In true awful Google fashion, it defaulted to creating my load balancer using the classic strategy, which is deprecated, and suddenly I got a ton of notifications to migrate to the new strategy. Why didn't it just default to creating new balancers on the new strategy? Hilariously the migration process can't be done in one step, you have to go through a four-step process of preparing, doing a % migration, etc. I literally laughed out loud when I encountered this. It's just pure distilled Googliness.

Second, I needed to set up an edge SSL certificate. The load balancer setup had a nice simple flow that created one for me with a couple clicks. Great! Except to get the SSL certificate signed, I had to prove that I owned the anukari.com domain by pointing the A record to my Google public IP. Which is impossible, because it is a commercial website serving user traffic, and changing the A record to point to an IP that currently doesn't have an SSL cert will bring the site down for 24-72 hours while Google verifies my domain ownership.

I poked around a bit and found that Google's SSL system has an alternative way to prove domain ownership, via a CNAME record instead of the A record. This would allow me to prove domain ownership without bringing the site down. Then Google could provision the SLL cert and I could cut over my DNS A record with zero downtime. Great! Except in true Google fashion, this feature is only available for the new certificate system, and not the "classic" system that works well with global load balancers.

Google Cloud's solution to this is absolutely comical. There's a special kind of object you create that can be used to "map" new SSL certificates into the classic system and make them available to the global load balancer. Of course there is no point-and-click GUI for this stuff, so you have to run a bunch of extremely obscure gcloud commands to set up the mappings and link them to the load balancer. Again this is just so Googly it was completely comical. When I was inside Google these kinds of things kind of made sense, but I can't believe Google does this stuff to their external customers. Misery loves company, I guess? :)

All of the pain-in-the-butt setup aside, now I have the web service running on Google Cloud, I am quite happy. It performs substantially better than on Railway for about the same price, and I no longer have to worry about my host screwing me over in an emergency by blocking me from pushing whatever software I want to my container. The trade-off is that the configuration is WAY more complex but I can live with that.

by Evan at 12/30/2025, 6:32:17 PMweb

newer postarchiveolder post