New optimization: mixing mics on GPU
Captain's Log: Stardate 78115.4
The pre-alpha has been very helpful so far. I'm getting a lot of info about daw compatibility, which is looking to be something that I need to address on a daw-by-daw basis. Fortunately a lot of DAWs have generous free trials, so I might be able to get things debugged without buying all the DAWs in existence. Anyway, I expected this, and it's just a matter of working through all the issues.
Sadly, the reports of Mac performance have not been good. I expected this a bit less. I think what happened is that I got Mac performance to an OK place (not amazing yet, since that has to wait for the Metal port), but then I did not do enough testing after adding the MPE / instanced voice support. I knew that this slowed things down, but on Windows it was tolerable. It seems that on Mac the performance hit was harder.
The good news is that I have an idea to solve this particular performance problem, and not only will it improve performance on both Windows and Mac, but also it should make things faster than they were before MPE was added.
The key issue with MPE/instancing is that my naive implementation ended up copying a lot more audio sample data between the CPU and GPU, which is slow. The problem is that there's a trade-off between mapping exactly the data that needs to be copied, meaning a lot of map calls, and copying unneeded data via fewer, bigger maps. The way the audio sample data is strided has a lot of impact here, with certain striding patterns leading to very little wasted copy bandwidth at the expense of way too many map calls.
Anyway the reason for this predicament is that final microphone audio mixing is done on the CPU, so every microphone's sample data for every instrument instance has to be copied off the GPU. And even if a instance/mic is not in use, we still might copy its data because it is in-between other data we mapped. You might be able to see where this is going...
The solution is to do the mixing on the GPU. This will dramatically decrease the GPU -> CPU copy bandwidth, and the mixing computation might even be faster because it will be done in parallel, and via faster memory. I will have the GPU mix the data down to the absolute minimum data needed by the CPU, so there will be zero wasted copy. And the GPU will pack it into a format that requires only a single map call to copy. Best of both worlds.
There are some other details to consider, like peak detection (used to show mic signal levels in 3D), but I have worked out how to do those on the GPU in an efficient way.
I am pretty sure this will be a big speedup. I hope but don't know for sure that it will make Mac usable. Now to implement what I've drawn on the chalkboard...