The chaos monkey lives
In the last couple of days I finally got around to building the "chaos monkey" that I've wanted to have for a long time. The chaos monkey is a script that randomly interacts with the Anukari GUI with mouse and keyboard events, sending them rapidly and with intent to cause crashes.
I first heard about the idea of a chaos monkey from Netflix, who have a system that randomly kills datacenter jobs. This is a really good idea, because you never actually know that you have N+1 redundancy until one of the N jobs/servers/datacenters actually goes down. Too many times I have seen systems that supposedly had N+1 redundancy die when just one cluster failed, because nobody had tested this, and surprise, the configuration somehow actually depends on all the clusters being up. Netflix has the chaos monkey, and at Google we had DiRT testing, where we simulated things like datacenter failures on a regular basis.
But the "monkey" concept goes back to 1983 with Apple testing MacPaint. Wikipedia claims that the Apple Macintosh didn't have enough resources to do much testing, so Steve Capps wrote the Monkey program which automatically generated random mouse and keyboard inputs. I read a little bit about the original Monkey and it's funny how little has changed since then. They had the problem that it only ran for around 20 minutes at first, because it would always end up finding the application quit menu. I had the same problem, and Anukari now has a "monkey mode" which disables a few things like the quit menu, but also dangerous things like saving files, etc.
The Anukari chaos monkey is decently sophisticated at this point. It generates all kinds of random mouse and keyboard inputs, including weird horrible stuff like randomly adding modifiers and pressing keys during a mouse drag. It knows how to move and resize the window (since resizing has been a source of crashes in the past). It knows about all the hotkeys that Anukari supports, and presses them all rapidly. I really hate watching it work because it's just torturing the software.
The chaos monkey has already found a couple of crashes and several less painful bugs, which I have fixed. One of the crashes was something completely I completely didn't expect, and didn't think was possible, having to do with keyboard hotkey events deleting entities while a slider was being dragged to edit the parameters of such entities. I never would have tested this manually because I didn't think it was possible.
The chaos monkey is pretty simple. The biggest challenges were just keeping it from wreaking havoc on my workstation. I'm using pyautogui, which generates OS-level input events, meaning that the events will get sent to whatever window is active. So at the start, if Anukari crashed, the chaos monkey would start torturing e.g. VSCode or Chrome or something. It was horrible, and a couple of times it got loose and went crazy. It also figured out how to send OS-level hotkeys to open the task manager, etc.
Eventually the main safety protection I ended up implementing is that prior to each mouse or keyboard event, the script uses the win32 APIs to query the window under the mouse, and verifies that it's Anukari. There's some fiddly stuff here, like figuring out whether a window has the same process ID as Anukari (some pop-up menus don't have Anukari as a parent window), and some special handling for file browser menus, which don't even share the process ID. But overall I've gotten it to the point where I have let it run for hours on my desktop without worry.
The longest Anukari has run now with the Chaos monkey is about 10 hours with no crashes. Other things looked good too, for example, it doesn't leak memory. I have a few more ideas on how to make the chaos monkey even more likely to catch bugs, but for now I'm pretty satisfied.
Here's a quick video of the chaos monkey interacting with Anukari. Note that during the periods where the mouse isn't doing anything, it's mashing hotkeys like crazy. I'm starting to feel much more confident about Anukari's stability.
Complications from custom 3D models
While working on the new 3D model assets for Anukari, one "little" TODO that I filed away was to make the mouse interactions with the 3D objects work correctly in the presence of custom models. This includes things like mouse-picking or box-dragging to select items.
In the first revision, because the 3D models were fixed, I simply hard-coded each entity type's hitbox via a cylinder or sphere, which worked great. However, with custom 3D models this is no longer tenable. The old hitboxes did not necessarily correspond to the shape or size of the custom models. This became really obvious and annoying quite quickly as we began to change the shapes of the models.
This was one of those problems that seems simple, and spirals into something much more complicated.
My first idea was to use Google Filament's mouse picking feature, which renders entity IDs to a hidden buffer and then samples that buffer to determine which entity was clicked. This has the advantage of being pixel-perfect, but it uses GPU resources to render the ID buffer, and it also requires the GUI thread to wait a couple frames for the Renderer thread to do the rendering and picking. Furthermore, it doesn't help at all with box-dragging, as this method is not capable of selecting entities that are visually obscured by closer entities. But in the end, the real killer for this approach was the requirement for the GUI thread to wait on the Renderer thread to complete a frame or two. This is doable, but architecturally problematic, due to the plans that I have to simplify the mutex situation for the underlying data model -- but that's a story for another day.
My second idea was to write some simple code to read the model geometry and approximate it with a sphere, box, or cylinder, and use my existing intersection code based on that shape. But I really, really don't want to find myself rewriting the mouse-picking code again in 3 months, and I decided that this approach just isn't good enough -- some 3D models would have clickable areas that were substantially different from their visual profile.
So finally I decided to just bite the bullet and use the Bullet Physics library for collision handling. It supports raycasting, which I use for mouse-picking, and then I use generalized convex collision detection for frustum picking. The documentation for Bullet sucks really hard, but with some help from ChatGPT it wasn't too bad to get up and running. The code now approximates the 3D model geometry with a simplified 42-dimensional convex hull, which is extremely fast for the collision methods I need, and approximates even weird shapes quite well (I tried using the full un-approximated 3D geometry for pixel-perfect picking, but it was too slow). I'm very happy with the results, and it seems that pretty much any 3D model someone can come up with will work well with mouse interactions.
The things that made this a week-long job rather than a 2-day job were the ancillary complications. The main issue is that while the old hard-coded hitboxes were fixed at compile-time, the new convex hull hitboxes are only known by the Renderer thread, and can dynamically change when the user changes the 3D skin preset. This introduced weird dependencies between parts of the codebase that formerly did not depend on the Renderer. I ended up solving this problem by creating an abstract EntityPicker interface which the Renderer implements, so at least the new dependencies are only on that interface rather than the Renderer itself.
An example here is when the user copies and pastes a group of entities. The data model code that does this has a bunch of interesting logic to figure out where the new entities should go, in order to avoid overlapping them with any existing entities. It's a tricky problem because we want them to go as close as possible to where the user is looking, but have to progressively fall back to worse locations if the best locations are not available. Anyway, this requires being able to query the AABBs of existing entities, which is now geometry-dependent.
Another example is when creating new entities. This is similar to copying and pasting, but the requirement to have the entity go near where the user clicked the mouse is less flexible. A final example is rotating entities, where the rotational sphere radius needs to be known, as well as the desired diameter for the "rotation cage' that appears around the entity to indicate it's being rotated.
Anyway, it took a few days but finally I think I have all of these use-cases working correctly. Fortunately I had unit tests for all this stuff, so that helped a lot. This is a pretty nice milestone, since I think this is the last "heavy lift" for the new 3D model configurability feature.
As usual, there are still a few fiddly details that I need to address. The biggest one is that it's a little slow when you delete 1,000 entities. This is an edge case, but it is noticeable and irritates me. I think I know what I'll do to speed it up, but we'll see.
The new warp alignment optimizer
As I mentioned in the previous post, I've been working on writing a better algorithm for optimizing the way that entities are aligned to GPU warps (or as Apple calls them, SIMD-groups). For the sake of conversation, let's assume that each GPU threadgroup is 1024 threads, and those are broken up into 32 warps, each of which has 32 threads. (These happen to be actual numbers from both my NVIDIA chip and my Apple M1.)
Each warp shares an instruction pointer. This is why Apple's name for them makes sense: each SIMD-group is kind of like a 32-data-wide SIMD unit. In practice these are some pretty sophisticated SIMD processors, because they can do instruction masking, allowing each thread to take different branches. But the way this works for something like "if (x) y else z" is that all the SIMD unit executes instructions for BOTH y and z on every thread, but the y instruction is masked out (has no effect) for threads that take the z branch, and z is masked out for threads that take the y branch. This is not a huge deal if y and z are simple computations, but each branch has dozens of instructions, you have to wait for each branch to be executed in serial, which is slow.
Note that this penalty is only paid if there are actually threads that take both branches. If all the threads take the same branch, no masking is needed and things are fast. This is the key thing: at runtime, putting computations with similar branch flows in the same warp is much faster than mixing computations with divergent branch flows.
For Anukari, the most obvious way to do this is to group entities by type. Put sensors in a warp with other sensors, put bodies in a group with other bodies, etc. In total, including sub-types, Anukari has 11 entity types. This means that for small instruments, we can easily sort each entity type into its own warp, and get huge speedups. This is really the main advantage of porting from OpenCL to CUDA: just like Apple, NVIDIA artificially limits OpenCL to just 8 32-thread warps (256 threads). If you want to use the full 32 32-thread warps (1024 threads), you have to port to the hardware's native language. Which, we really do, because with OpenCL's 8 warps, we're much more likely to have to double-up two entity types in one warp. Having 32 warps gives us a ton of flexibility.
The Algorithm
So we have 11 entity types going into 32 buckets. This is easy until we consider that an instrument may have hundreds of one entity type, and zero of another, or that instruments might have 33 of one entity type, which doesn't fit into a single bucket, etc. What we have here is an optimization problem. It is related to the bin-packing problem, but it's more complicated than the vanilla bin-packing problem because there are additional constraints, like the fact that we can break groups of entities into sub-groups if needed, and the fact that it needs to run REALLY fast because this happens on the audio thread whenever we need to write entities to buffers.
I'm extremely happy with the solution I ended up with. First, we simplify the problem:
- Each entity type is grouped together into a contiguous unit, which might internally have padding but will never be separated by another entity type.
- We do not consider reordering entity types: there is a fixed order that we put them in and that's it. This order is hand-chosen so that the most expensive entities are not adjacent to one another, and thus are unlikely to end up merged into the same warp.
A quick definition: a warp's "occupancy" will be the number of distinct types of entities that have been laid out within that warp. So if a warp contains some sensors, and some LFOs, its occupancy would be 2.
The algorithm then is as follows:
- Pretend we have infinite warps, and generate a layout that would be optimal. Basically, assign each entity type enough warps such that the maximum occupancy is 1. (This means that some warps might be right-padded with no-op entities.)
- If the current layout fits into the actual amount of warps the hardware has, we are done.
- If not, look for any cases where dead space can be removed without increasing any warp's occupancy. (On the first iteration, there won't be any.)
- Increment the maximum allowable occupancy by 1, and merge together any adjacent warps that, after being merged, will not exceed this occupancy level.
- Go back to step 2.
That's it! This is a minimax optimizer: it tries to minimize the maximum occupancy of any warp. It does this via the maximum allowable occupancy watermark. It tries all possible merges that would stay within a given occupancy before trying the next higher occupancy.
There are a couple of tricks to make this efficient, but the main one is that at the start of the algorithm, we make a conservative guess as to what the maximum occupancy will have to be. This way, if the solution requires occupancy 11 (say there are 1000 bodies and 1 of each remaining entity type, so the last warp has to contain all 11 types), we don't have to waste time merging things for occupancy 2, 3, 4, ... 11. It turns out that it's quite easy to guess within 1-2 occupancy below the true occupancy most of the time. I wrote a fuzz test for the algorithm, and in 5,000 random entity distributions the worst case is 5 optimizer iterations, and that's rare. Anyway, it's plenty fast enough.
The solutions the optimizer produces are excellent. In cases where there's a perfect solution available, it always gets it, because that's what it tries first. And in typical cases where compromise is needed, it usually finds solutions that are as good as what I could come up with manually.
Results
That's all fine and good, but does it work? Yes. Sadly I don't have a graph to share, because this doesn't help with all my microbenchmarks. Those are all tiny instruments for which the old optimizer worked fine.
But for running huge complex instruments, it is an ENORMOUS speedup, often up to 2x faster with the new optimizer. For example, for the large instrument in this demo video, it previously was averaging about 90% of the latency budget, with very frequent buffer overruns (the red clip lines in the GPU meter). That instrument is now completely usable with no overruns at all, averaging maybe 40% of the latency budget. Other benchmark instruments show even better gains, with one that never went below 100% of the latency budget before now at about 40%.
This opens up a TON more possibilities in terms of complex instruments. I think at this point, at least on Windows with NVIDIA hardware, I am completely satisfied with the performance. Apple with Metal is almost there but still needs just a tiny bit more work for me to be satisfied.