devlog > testing
The chaos monkey lives
In the last couple of days I finally got around to building the "chaos monkey" that I've wanted to have for a long time. The chaos monkey is a script that randomly interacts with the Anukari GUI with mouse and keyboard events, sending them rapidly and with intent to cause crashes.
I first heard about the idea of a chaos monkey from Netflix, who have a system that randomly kills datacenter jobs. This is a really good idea, because you never actually know that you have N+1 redundancy until one of the N jobs/servers/datacenters actually goes down. Too many times I have seen systems that supposedly had N+1 redundancy die when just one cluster failed, because nobody had tested this, and surprise, the configuration somehow actually depends on all the clusters being up. Netflix has the chaos monkey, and at Google we had DiRT testing, where we simulated things like datacenter failures on a regular basis.
But the "monkey" concept goes back to 1983 with Apple testing MacPaint. Wikipedia claims that the Apple Macintosh didn't have enough resources to do much testing, so Steve Capps wrote the Monkey program which automatically generated random mouse and keyboard inputs. I read a little bit about the original Monkey and it's funny how little has changed since then. They had the problem that it only ran for around 20 minutes at first, because it would always end up finding the application quit menu. I had the same problem, and Anukari now has a "monkey mode" which disables a few things like the quit menu, but also dangerous things like saving files, etc.
The Anukari chaos monkey is decently sophisticated at this point. It generates all kinds of random mouse and keyboard inputs, including weird horrible stuff like randomly adding modifiers and pressing keys during a mouse drag. It knows how to move and resize the window (since resizing has been a source of crashes in the past). It knows about all the hotkeys that Anukari supports, and presses them all rapidly. I really hate watching it work because it's just torturing the software.
The chaos monkey has already found a couple of crashes and several less painful bugs, which I have fixed. One of the crashes was something completely I completely didn't expect, and didn't think was possible, having to do with keyboard hotkey events deleting entities while a slider was being dragged to edit the parameters of such entities. I never would have tested this manually because I didn't think it was possible.
The chaos monkey is pretty simple. The biggest challenges were just keeping it from wreaking havoc on my workstation. I'm using pyautogui, which generates OS-level input events, meaning that the events will get sent to whatever window is active. So at the start, if Anukari crashed, the chaos monkey would start torturing e.g. VSCode or Chrome or something. It was horrible, and a couple of times it got loose and went crazy. It also figured out how to send OS-level hotkeys to open the task manager, etc.
Eventually the main safety protection I ended up implementing is that prior to each mouse or keyboard event, the script uses the win32 APIs to query the window under the mouse, and verifies that it's Anukari. There's some fiddly stuff here, like figuring out whether a window has the same process ID as Anukari (some pop-up menus don't have Anukari as a parent window), and some special handling for file browser menus, which don't even share the process ID. But overall I've gotten it to the point where I have let it run for hours on my desktop without worry.
The longest Anukari has run now with the Chaos monkey is about 10 hours with no crashes. Other things looked good too, for example, it doesn't leak memory. I have a few more ideas on how to make the chaos monkey even more likely to catch bugs, but for now I'm pretty satisfied.
Here's a quick video of the chaos monkey interacting with Anukari. Note that during the periods where the mouse isn't doing anything, it's mashing hotkeys like crazy. I'm starting to feel much more confident about Anukari's stability.
Writing bad crypto code
Captain's Log: Stardate 78089.3
Today I tried to slam out as much work as I could on the lower-level c++ code to do all the license activation stuff. Since I got all the PITA stuff done yesterday to get openssl, libcurl, etc, working, today was actually fairly productive. The c++ code now has APIs to call the license API server to register a new license key, save the signed key to disk, check if the key is still valid, and so on.
I find all this anti-piracy stuff pretty annoying/frustrating, since none of the work makes the plugin more fun to use. And one particular worry I have is that I want to make VERY sure that it never screws over someone who has paid for the software. So I have been thinking a lot about testing the license activation code. I wrote a few unit tests, but eventually got to the HTTP parts, and mocking out the HTTP responses just didn't feel great.
The solution I came to was to actually just write a bunch of integration tests that have the c++ license activation/verification code call the production license API server. I added some hidden/test-only APIs that allow the integration tests to reset the server state, and created a test-only license key. So the tests actually go through all the real flows a user will go through and make sure the c++ code and the API server talk to one another correctly, and that the API database reflects the correct state, etc.
I'm quite happy with this testing solution, because it gives me a ton of confidence that the client/server interactions all work the way they should. And since nothing is mocked, it's a really faithful test of what the actual binary will do.
Tomorrow I'll start on the c++ GUI for entering the license key and so on.
Prepping website for pre-alpha
Captain's Log: Stardate 78026.5
I've been having a lot of fun working on the website, so I'm going with that and focusing on getting a lot of web stuff done.
Auth is working pretty well now, and I'm starting to get set up to send the various account lifecycle emails. I have a bit of a start on some of the ancillary auth stuff like password resets and so forth, and have figured out how React form actions work, which let me clean up all the various auth forms. Form fields are validated server-side and the errors are automatically shown inline in the form client-side, it's all very nice and simple. (I continue to be impressed and pleased with React and next.js.)
I've also done some annoying grunt work like setting up CSRF protections that automatically work on all forms.
Given how fast this is all going, I probably will just charge forward on the website until it's basically read for a closed pre-alpha, in terms of providing the API endpoints for product key registration, and the GUI for managing product keys. This isn't strictly necessary at this point, but it certainly won't hurt to test out the web stuff during the pre-alpha, and like I said, I'm having fun building it so why not?