devlog > bug
Abandoning resend.com for email
Captain's Log: Stardate 79603
I mentioned in my last post that I've begun working on a marketing plan for Anukari. I'm feeling pretty optimistic that it will be possible to get a steady trickle of sales going, and I also think it's possible that something more than a trickle could happen at some point. Fundamentally I just want enough sales to fund my continued development of Anukari, but I'd obviously be open to more sales, which could lead to further interesting possibilities.
Either way, I like to plan for success, and so one thing I've been working on is fixing a few small customer support headaches, so that they don't become big headaches in the event of lots of sales.
Long term the biggest thing will always be making sure Anukari performs well and is compatible with as many machines as possible. That is feeling pretty good right now.
Now that the product is not generating so many support requests, the next biggest support issue is with account emails failing to be delivered to users. The chief issue here is that occasionally a user won't receive the email containing their license key after they purchase Anukari.
This is immensely frustrating for the customer. The normal flow is that when they checkout via Shopify, they will receive a receipt from Shopify, and in that email, it says "you'll get a second email with the license key." Obviously I'd prefer for these to be one email, but that's not really possible at this point, and it's not that big of a deal.
But some users get the email from Shopify, and then never get the second email. I am the only customer support person, so this means that if someone buys Anukari right after I got to bed, even if they email to complain, they won't get their license key for at least 8 hours. Awful! Imagine it's Friday night, you're stoked to finally have a few hours to play with some new VSTs, you spend your hard-earned money on one, and then no license key comes until tomorrow. Ugh. I'd be super annoyed. (That said, evidently my customers are saints, because they have all been super understanding.)
Email is a tricky thing. Spam filters exist, and also things like IP reputation are really complicated. I have worked on spam/fraud prevention, so I know how difficult some of these problems are.
When email delivery fails, my first question has always been "did you check your spam folder." Curiously, this has almost never been the case.
The main reason that I've had emails fail to be delivered is that Resend, my email provider, simply chose not to send the emails as requested, due to their global email suppression list. This is a list of email addresses spanning across all Resend accounts, that have hard-bounced or been flagged by providers. If a customer's email somehow ends up on this list, Resend will simply not attempt any email delivery. Resend's help page says that a user marking an email as spam could be a reason for them to be globally suppressed. That means that if another Resend user sends a spammy-email to someone, it might block you from sending email to them later.
I've contacted Resend support about this several times. At one point I was told that I could remove addresses from this global suppression list myself. At other times, the Resend customer support folks had to remove the email address for me. The GUI for suppression seems to have changed a couple times, right now it doesn't look like Resend users can actually request removal from the suppression list. In my case, the GUI stated that the email address that failed delivery had already been removed from the suppression list; however I retried sending the email and it was suppressed again.

Over the course of several contacts with Resend support about this global suppression list, it became clear that Resend sees this as a feature, not a bug. And again, I understand that they have a difficult problem to solve, maintaining good deliverability with multiple users, some of which surely are spammers trying to abuse Resend's platform.
The thing is: to me, Resend has exactly one job, and it is to attempt deliver email when I send it. And sometimes, Resend simply chooses not to even attempt delivery. On purpose.
The email addresses that Resend has suppressed are totally normal, just everyday people, typically on GMail accounts, nothing weird. Shopify's emails invariably get delivered without issue, and then my emails from GMail to respond to their support requests also work perfectly. Perhaps there was a one-time delivery hiccup, or another Resend user sent out some spam. Either way, Resend is being super aggressive about suppression in a way that impacts all their users.
IMO the global suppression list is an incredibly bad design, especially with how heavy-handed Resend is being in populating it. I can verify firsthand that it causes plenty of grief. It's a difficult problem, but there are many approaches to managing reputation for shared email IPs. This one is no bueno.
So after several complaints to Resend without a solution, I migrated anukari.com completely off Resend, to Postmark. Postmark has a very good reputation for high deliverability, which again, is the only thing I actually want from an email provider. As far as I can tell, Postmark does not have a global email suppression list, but instead has per-account per-stream lists, which makes way more sense. Here's hoping that this leads to fewer frustrated customers!
Railway.com knows better than you
Captain's Log: Stardate 79429.5
[EDIT1] Railway responded to this post in an HN comment. Quoting from their reply, "I wished the majority of our userbase knew better than us, but the reality is they don't."
[EDIT2] Railway's founder replied in a separate HN comment.
[EDIT3] Railway added an option to bypass the checks.
[EDIT4] I wrote a post about how this website is now on Google Cloud.
My honeymoon period with Railway.com for hosting this website is over.
I recently ran into a bug in my own code that caused some data corruption in my DB. The consequences of the bug were that a user who purchased Anukari could get emailed a license key that had already been sent to another user. This is really bad! Fortunately I noticed it quickly, and the fix was simple. I got the code prepared and went to do an emergency release to push out the fix.
It's worth noting that the bug had been around for a while, but it was latent until a couple days ago. So the fix was not a rollback, which would typically be the ideal way to fix a production issue quickly. The only solution was to roll forward with a patch.
Imagine my surprise when, during my emergency rollout, the release failed with:
==================================================================
SECURITY VULNERABILITIES DETECTED
==================================================================
Railway cannot proceed with deployment due to security
vulnerabilities in your project's dependencies.
Keep in mind that I'm super stressed about the fact that there is a severe and urgent problem with my app in production, which is causing me problems in real time. I've dropped everything else I was doing, and have scrambled to slap together a safe fix. I want to go to bed.
And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.
I'll get back to the actual security vulnerabilities Railway detected, but let's ignore that for the moment and talk about the fact that Railway has just intentionally blocked me from pushing out a release, based on their assessment of the security risk. But how can that possibly make sense? Railway cannot know how urgent my release is. Railway cannot know whether the security vulnerabilities they detected in my app are even exploitable. They're just looking at node package versions marked as vulnerable.
The most ridiculous part of this is that the current production version of my app that I was trying to replace depended on the same vulnerable package versions as the patched version I was trying to push out to fix my urgent bug. So in fact the release added zero additional security risk to what was already serving.
Okay so what were the security vulnerabilities that were so dangerous that Railway.com's nanny system engaged to block me from pushing out an urgent fix? They cited two HIGH risk CVEs: this one and this one. They also cited a MEDIUM risk CVE but I'm not going to cover that.
The HIGH risk CVEs are both DOS issues. Attackers can craft payloads that send the server into an infinite loop, sucking up CPU and denying service.
Do I want to fix those CVEs for my service? Absolutely! Would I like my hosting provider to shoot me an email if they detect that my app has those vulnerabilities? Yes, please!
Do I want to be blocked from pushing urgent fixes to my app when there's zero evidence that either of those CVEs are being exploited for my app in any way? I think not.
I want to push out my fix, go to bed, and then come back the next day and upgrade package versions and fix the CVEs.
Now, look, was it a huge deal to upgrade those packages? Not really. In this case I was lucky that it was a straightforward upgrade. But anyone who's used a package manager knows about dependency hell, where trying to upgrade one package leads to a cascade of other dependencies needing to be updated, possibly with conflicts in those dependencies. And even if there is no dependency hell, any time package versions are changed, some degree of testing is warranted to make sure nothing was broken.
These are things that I did not want to even remotely think about during an urgent incident late in the evening, especially not for package versions that were already live in production. It took a stressful experience and added a huge amount of additional frustration.
Railway sometimes uses the mottos, "Ship without friction," and "Let the engineers ship." This felt like friction. I was literally not allowed to ship.
I complained about this to Railway, and they basically said they need to do this to protect neighboring apps from problems caused by my app. I was under the impression that this was the purpose of containers, but what do I know.
I'm guessing Railway is having issues with apps on their free tier wasting CPU resources, costing them money, and found some free-tier apps succumb to these infinite loop CVEs, wasting free vCPUs. But I'm not on the free tier -- I pay for what I use.
I've been a huge fan of Railway for hosting this site. It really is simple and fast, and that's been perfect for such a simple site. I don't need a lot of fancy features, and Railway has let me deploy my code easily without worrying about the details. But this security overreach gives me pause, so this is a great time to look at other hosting providers.
I talked to the founder of Railway's competitor Northflank about this issue, and his reply was, "We would never block you from deploying if we can build or access a valid container image." To me that's the right answer.
Bird's eye view
When I worked at Google, for a long time I was involved with production reliability/safety. I wasn't an SRE, but I worked on services that were in the critical path for ads revenue. Incidents often cost us untold millions of dollars, so reliability was a huge concern.
I wrote and reviewed many dozens of postmortem analyses, and I learned a lot about the typical ways that systems fail. I was in 24x7 oncall rotations for my entire career at Google, and during the latter years I was oncall as an Incident Commander inside ads, running the responses to (very) large incidents.
Google's ads systems were mostly pretty mature, and contained lots of safety checks, etc. I always found it really fascinating to see the ways that a serious outage would slip through all the defenses.
One pattern I came to recognize was that when deciding on the action items / follow-ups for an incident postmortem, the go-to reaction was always to add another safety check of some kind. This makes sense! Something went wrong, and it slipped past our defenses, so more defenses should help.
However I came to view adding more and more defenses as an anti-pattern. There's a baseline level of safety checks that make a system safer, but if you are not careful, adding more safety checks actually increases risk, because the safety checks themselves start to be the cause of new problems. There's a great word for this, by the way, iatrogenesis.
I'll give an example. I worked in (anti) ad fraud, and we had systems that would block ad requests that looked fraudulent or scammy. Bugs in these systems were extraordinarily dangerous, because in the worst case we could turn off all advertising revenue globally. (Yes, that is a thing that happened. More than once. I have stories. Buy me a beer sometime...)
These systems accumulated many safety checks over the years. A typical thing would be to monitor the amount of revenue being blocked for fraud reasons, and alert a human if that amount grew too quickly.
But in some cases, these safety checks were allowed to automatically disable the system as a fail-safe, if things looked bad enough. The problem is... there's not really any such thing as "failing safe" here. For a spam filter, failing closed means revenue loss, and failing open means allowing fraud to get through.
So imagine that a script kiddie sits down and writes a bash loop to wget an ad click URL 10,000 times per second. The spam filtering systems trivially detect this as bot traffic, and filter it out so that advertisers aren't paying for these supposed clicks.
But maybe some safety check system sees that this trivial spam filter has suddenly started blocking hundreds of thousands of dollars of revenue. That's crazy! The filter must have gone haywire! We better automatically disable the filter to stop the bleeding!
Uh oh. Now those script kiddie requests are being treated as real ad traffic and billed to advertisers. We have ourselves here a major incident, caused by the safety check that was meant to protect us, created as the result of a previous postmortem analysis.
I'm not saying that all safety checks are bad. You certainly need some. But what I am saying is that if you allow extra safety checks to become the knee-jerk reaction to incidents, you will eventually end up with a complex mess where the safety checks themselves become the source of future incidents. You build a Rube Goldberg Safety Machine, but it doesn't work. The marble flies off the rails.
It's really attractive to just keep adding safety checks, but what is actually required is to take a step back, look at the system holistically, and think about simple ways to make it safer. Generally after a few basic safety checks are in place, further safety comes from redesigning the system itself. Or even better, rethinking the problem that you're trying to solve altogether. Maybe there's a completely different approach that side-steps an entire category of reliability issues.
Working better on some Radeon chips
Captain's Log: Stardate 79013.9
The issue with Radeon
As discussed in a previous post, I've been fighting with Radeon mobile chips, specifically the gfx90c. The problem originally presented with a user that had both an NVIDIA and a Radeon chip, and even though they were using the NVIDIA chip for Anukari, somehow in the 0.9.6 release something changed that caused the Radeon drivers to crash internally (i.e. the drivers did not return an error code, they were simply aborting the process).
I'd like to eventually offer official support for Radeon chips. That's still likely a ways off, but at the very least I don't want things crashing. Anukari is extremely careful about how it interacts with the GPU, and when a particular GPU is not usable, it should (preferred) simply pick a different GPU, or at the very least, show a helpful message in the GUI explaining the situation.
Unfortunately it was difficult to debug this issue remotely. The user was kind enough to run an instrumented binary that confirmed that Anukari was calling clBuildProgram() with perfectly valid arguments, and it was simply aborting. I really needed to run Anukari under a debugger to learn more.
So I found out what laptop my bug-reporting user had, and ordered an inexpensive used Lenovo Ideapad 5 on eBay. I've had to buy a lot of testing hardware, and I've saved thousands of dollars by buying it all second-hand or refurbished. In this case it did take two attempts, as the first Ideapad 5 I received was super broken. But the second one works just fine.
Investigation
After getting the laptop set up and running Anukari under the MSVC debugger, I instantly was seeing debug output like this just prior to the driver crash:
LLVM ERROR: Cannot select: 0x1ce8fdea678:
ch = store<ST2[%arrayidx_v41102(align=4)+2](align=2),
trunc to i16> 0x1ce8fe462a8, 0x1ce8fde8fb8, 0x1ce8fe470b8,
undef:i32
0x1ce8fde8fb8: f32,ch = load<LD2[%1(addrspace=1)(align=8)+294]
(align=2)> 0x1ce8fdd7638, 0x1ce8fde8b80, [...]
First of all, I want to call out AMD on their exceptionally shoddy driver implementation. It's just absurd that they'd allow a compilation error internal to the driver to abort the whole process. Clearly in this case clBuildProgram() should return CL_BUILD_PROGRAM_FAILURE, and the program log (the compiler error text) should be filled with something helpful, at a minimum, the raw LLVM output, but preferably something more readable. This is intern-level code, in a Windows kernel driver. Wow.
After reading through this carefully, all I could really make of it was that LLVM was unable to find a machine instruction to read data from this UpdateEntitiesArguments struct in addrspace=7 and write it to memory in addrspace=1. From context, I could guess that addrspace=1 is private (thread) memory, and addrspace=7 is whatever memory the kernel arguments are stored in. I had a harder time understanding why it couldn't find such an instruction. I thought maybe it had to do with an alignment problem, but wasn't sure.
This struct contains a number of fields, and I couldn't tell from the error which field was the problem. So I just used a brute-force approach and commented out most of the kernel code, and added code back in slowly. It compiled fine until I uncommented a line of code like float x = arguments.field[i]. I did some checking to ensure that field was aligned in a sane way, and after confirming that, I came to the conclusion that the gfx90c chip simply does not have an instruction for loading memory from addrspace=7 with a dynamic offset. In other words, the gfx90c appears to lack the ability to address arrays in argument memory with a non-constant offset.
Which, as far as I can tell, means that the gfx90c really doesn't support OpenCL properly. Every other OpenCL implementation I've used can do this, including NVIDIA, Intel Iris, Apple, and even newer Radeon chips like the gfx1036. I don't see anything in the OpenCL specification that would indicate that this is a limitation.
But even assuming that it's somehow within specs for an OpenCL implementation not to support this feature, obviously aborting in the driver is completely unreasonable behavior. Again, this is a really shoddy implementation, and when people ask about why Anukari doesn't yet officially support Radeon chips, this is the kind of reason that I point to. The drivers are buggy, and worse they are inconsistent across the hardware.
The (very simple) workaround
Anyway, I have very good (performance) reasons for storing some small constant-size arrays (with dynamic indexes) in kernel arguments, but those reasons really apply more to the CUDA backend. So I made some simple changes to Anukari to store these small arrays in constant device memory, and the gfx90c implementation now works just fine.
Given that I upgraded my primary workstation recently to a very new AMD Ryzen CPU, I now have two Radeon test chips: the gfx90c in the Ideapad 5, and the gfx1036 that's built-in to my Ryzen. The Anukari GPU code appears to work flawlessly on both, though doesn't perform all that well on either. Next up will be doing more testing of the Vulkan graphics, which have also been a pain point in the past on Radeon chips.
The Audio Units logo and the Audio Units symbol are trademarks of Apple Computer, Inc.
VST is a trademark of Steinberg Media Technologies GmbH, registered in Europe and other countries.