Figured you might like to be able to see my todo list. I just moved it out of my notepad file and into trello, so everything is listed as just being added.
I don’t think he’ll mind if I share these tidbits from his victories. All of the below is written by him:
– moved main sim execution to separate thread- moved most of the “ui reading the current gamestate” stuff to that thread too
– moved the generation of sync data to that thread as well
– moved the update-visual-object stuff outside of process-sim-step; still on the main thread of course but now it’s easier for you to profile it without sim stuff cluttering up the results
– various performance improvements, some of which are fairly significant
So right now there’s some volatility in the visualization code but it’s not causing problems right now since I think it’s mostly dealing in primitive types. Will need to be dealt with later but doesn’t appear pressing.
Performance-wise the sim’s burden on the main thread is almost gone during most frames. There’s still some spikes when a lot of entities die at once, as that’s a lot of list removal work. I can actually optimize that a bit more, but for now with the sim-on-the-main-thread almost always under 2ms (and often way under that) I figured this was ok for now 🙂
Oh, I should also mention that the main thread per-frame heap alloc is also almost gone, and most of that was eliminated rather than just shifted elsewhere.
One Hour Later
I’m currently working on the other-thread heap alloc, via the deep-profiler and setting SimPlanningLoop.IsOnMainThreadForProfilingPurposes to true. It just makes everything execute sequentially, rather than farming it out.
Another Hour Later
Just nuked the heap alloc of the short term planning threads (which do most of the work) from about 800k to about 40k; I can cut about 25k more, too, just needed to go now. The remaining looks like legitimate allocation on the partition list (used for find-everything-within-range checks), and that should in theory stop once the list is big enough that all future checks use less space.
The execution thread is still doing 300k+ per go, and I’ll take a look at that later. Not sure what the long term (AI) thread is doing heap-wise, haven’t seen it in the results yet.
One of the things that I recently discovered was that asset bundles for meshes seem cross-compatible for at least Windows and OSX, but shaders for sure – and possibly materials – are not. The shaders bit I knew; DirectX9 shaders aren’t going to work in OpenGL.
So that’s wound up with us having to have separate asset bundles for each of the three platforms, which is a minor inconvenience but no big deal beyond that.
What is a bigger deal is that we’ve been using uAudio in order to stream mp3s off disk, and that doesn’t work on OSX. It was supposed to, I’m pretty positive, but anyhow it’s busted.
For that, I’m just going to throw my hands up and move elsewhere. I wanted a more unified audio pipeline anyhow, so that I can do some proper music ducking with sound effects playback integration once that’s in. uAudio was already going to complicate that for me (note that that’s one of a few reasons there are no sound effects, despite their being music, in the initial alpha of the game).
At any rate, I’m going to create a separate asset bundle for audio in general, and my hope is that this one can be cross-platform, which would be really nice. It won’t make much difference to you, but it saves me time uploading to Steam, and it gives more common ground between the different builds. Right now there’s almost a gig of stuff that is OS-specific, which is annoying.
Some folks were complaining to me about not wanting to support mp3 on principle, so I guess that wins out. 😉 With things in asset bundles, they’ll be in a streamable intermediate format that I believe is ogg. This will mean that in order to get the soundtrack you have to actually get the soundtrack, though; you won’t be able to just find it in your folders in the game install anymore. That’s an unintended side effect, but it will improve performance during gameplay, so there is that…
My old approach of reading oggs directly off disk was just too low-performance and memory-eating. The asset bundles approach will help preserve speed, which is a big deal in my opinion.
Well… okay then! Keith already moved the simulation off the main thread in the course of just this morning. Unbelievable. On my main machine, a GT72VR GRE Dominator Pro (so GTX 1070 and latest i7), I’m sitting pegged between 180fps and 200fps during the opening battles with all the settings cranked up and 8x MSAA and so on. Almost no giant lag spikes – amazing!
What is kind of odd is that now there are some GUI.Repaint spikes every so often that take 35ms to execute, but they are only every few seconds at most from what it seems like. I’m not sure what the deal is there, but it’s something to do with Unity’s GUI internals. I’m not happy about it, but one long frame every few seconds is not something I was really aware of until I did the profiling and saw it. I wasn’t noticing it while playing.
Strangely, the occlusion culling spikes seem to have disappeared, too. I don’t really trust that, because if they disappear for random reasons then they can come back for random reasons, too. But it makes it no longer an imperative thing that I try to fix that in the next week… since right now there’s nothing available to even fix. But when it comes back… I’ll be ready and waiting. 😉
Anyhow… way to go, Keith! I’ll have to retest this on my mac later today and see how things feel now.
Okay, so… “funny” story?
I’ve been mostly running ubuntu in a Parallels VM on my Macbook. It’s worked well for the last number of years. However, now… well, it’s a combination of drivers and just other accumulated issues.
Basically, the short of it is that my linux install still works, kinda-sorta, but it’s buggy as all get out and randomly crashes and is constantly reporting errors in itself, steam, and various random programs.
This VM was perfectly capable of running Release Raptor, at least at a low framerate with a lot of quality settings off. It also ran a much earlier build of AI War 2.
However, now it literally gives a blue screen in the window that unity pops up when I try to start AI War 2. I think that this is because my VM is so hosed in general, and also because it’s running an older version of ubuntu than unity now supports… and which I can’t upgrade from due to lack of virtual disk space.
All of which is to say, while we DO have “day one alpha” linux support in the game, as promised, I am not presently sure if it works at all. I’m tired of VMs, and I work pretty much exclusively on laptops these days, so what I’ve decided to do is order a new Oryx Pro from system76, with ubuntu natively installed.
All my other machines are MSI, Asus, or Alienware at present, aside from my mac and a trio of old desktop machines in my attic. I don’t have room to run a bunch of desktop towers in my office, though, which is why I went the VM route in the first place.
I looked into getting ubuntu on my existing laptops, but it’s just one of those things that is so fraught because of lack of driver support from those vendors in a lot of cases (shakes fist). Trying to test the game on those would involve hours of actually setting up the ubuntu install, possibly corrupting my master boot records (thanks, Windows 10), and possibly still having a steaming mess of unsupported hardware.
I figured I should Do This Right, instead. The bad news? It may be another two weeks before that new linux machine actually reaches me. So if linux works out of the gate for you during alpha, then I’ll be overjoyed and that’s a relief. If there are problems, I can try to use your error messages to figure out what is up, but most likely I’ll need to have my own linux box before I can really get into it if it’s something complicated.
Hopefully it just works. But you know how that sort of thing can go.
I apologize for not being more on top of the linux build from the start, but it WAS working and it hadn’t occurred to me to test it lately and find that it was suddenly not. The way it dies instantly pretty much guarantees that it’s missing something in my particular install, and not our code doing something that breaks linux. My worry is more that if you get past the first issue on your machine (having a non-broken install and all that), that you’ll instead run into something that’s broken in my part of the code.
Anyway, teething pains. Apologies again if it doesn’t work day 1, and fingers crossed that it will.
My Mac at present is an aging MacBook Pro, Retina, 15 inch, Early 2013. I’m still running 10.10.5, because I like to stay kind of out of date on that for testing purposes.
Worth pointing out for anyone not up on that hardware: the GPU is a measly Intel HD Graphics 4000 1024MB, and it’s running at 2880×1800. Tall order!
The GTX 1070 on my main windows machine is over 2600% faster than that little intel. Whew. Honestly that machine struggles REALLY hard to run Minecraft.
I also have a spare machine, this one running windows 8.1, that has a GTX 650M in it. I’m going to be benchmarking on that one, too – and it’s only about 145% better than the intel 4000. Still substantial, and it never had any trouble with Minecraft.
Anyway, overall from a first test on my mac with the various recent vis-layer performance improvements (but before those of the simulation thread today), I was pretty pleased with the results on my Mac. I had all the settings cranked all the way up, and while it was choppy it wasn’t a complete slideshow. That was at a high resolution, as much MSAA as it could muster, all the post-processing effects, all the special effects on max blast, full quality textures, etc.
The result wasn’t something I’d really want to play in terms of how it performed, but looking at that from a worst-case scenario I’m happy with how that was starting out. Cinth isn’t even done setting up quite all the ship LODs yet, so some of the vis layer was needlessly harsh, too.
I think some of the hitching was also due to music playback trying and failing to play, repeatedly. That may have been the biggest slowdown, I’m not sure.
Possibly if I get this updated to a newer version of OSX and see if it can run Metal, then I might be able to use the BC7 file format for textures, which right now is being transcoded to something else (possibly native uncompressed RGBA, which would really slow things down in the GPU bus).
For anyone already on a newer version of OSX, probably Metal support will “just work,” so long as your hardware supports it. I need to do a deeper dive into some of that, but I feel quite confident I can get this running in a way that is acceptably playable on my aging MacBook. It’s going to be one of those things that continually evolves over the next few months leading up to Early Access.
We’re headed for alpha of AI War 2 on Monday, and that’s for the people who want to get into the game “earlier than early,” which I’ll write more about later.
One of the things I wanted to talk about right now, though, is some hitching and jitter that will likely still be present to some degree in that first version. This is an issue that is cropping up this week due to two factors:
1. The simulation is becoming more complex, and thus more costly to run. It runs every 100ms, so that’s a hitch with an extra-long frame there.
2. The visualization code runs every frame, so usually 60-140 times per second on my machine, and so there’s a huge difference in the timing between the two sides. It’s only recently that this has become efficient enough for me to notice such a huge difference between the two groups of frames, though.
In other words, things were more balanced-out previously because we hadn’t finished certain optimizations AND because the simulation was still growing. Now the simulation has grown more and other optimizations are in, and so the jitter rears its head.
I’m writing this so that hopefully nobody freaks out on Monday. 😉 It won’t take us long to resolve this, but that’s one of the casualties of being in “earlier than early.” This is part of how the hotdog is made that you get to see.
Last post, I wrote about some improvements made to the fixed-int math as well as RNG calculations for noncritical bits. Those help the jitters noticeably.
Overall what we’re finding, though, is that we simply need to move the main simulation off the main thread of the game (it’s already using a lot of worker threads, but now the core loop also needs to go).
That was already on Keith’s todo list, but it got bumped up in priority in the last day or so. I don’t think either of us have any idea if that will be ready by Monday or not, but for the sake of safety I’m going to just assume not.
The other thing that is causing periodic hitching is Unity’s own occlusion culling logic. It occasionally jumps in with a 28ms frame, once or twice every two seconds. That’s incredibly frustrating.
That said, I found something kinda-sorta similar when I was coding Release Raptor, and I implemented my own frustrum culling solution that I had intended to port over to AI War 2, anyway. So that’s coming up soon, though likely not prior to Monday, either.
Occlusion culling, in this case, is basically just saying “hey, the camera isn’t pointed at me, so don’t send me to the GPU to render.” Unity has to calculate that sort of thing for every individual mesh that is rendered if I use their version. In my own version I can calculate that for things like entire squads. That cuts down the number of calculations enormously, and I can also calculate AABBs for the squads vastly faster than unity can for individual meshes.
That right there is a good example of the differences between generalized solutions (which must work in any game without knowledge of context) versus specialized solutions (which really wouldn’t work in general cases, but are incredibly efficient in certain specialized ones). It’s always about choosing a mixture of the two approaches, and right now the generalized approach is biting us in the rear more than I expected.
Particle System Updates!?
This one I don’t even know what it is, and I haven’t had time to investigate it yet. But every so often – I’m not even sure with what frequency – Unity is throwing a slow frame (again just under 30ms) where is is doing a “particle system update.”
Yes we’re using their shuriken particle system, but why it suddenly has to go through some sort of cleanup or… whatever it’s doing… I have no idea. So that’s another thing for me to investigate in the coming weeks.
In AI War Classic, the garbage collector running periodically was the biggest source of hitching. Thanks to our massive refactoring, we’re generating very little waste in AI War 2. There are some errant GC allocs that we’ll clean up later (you’d like an actual game first, we presume, heh), but those are all very minor and would fall under the banner of premature optimization right now.
At the moment, we’re keeping things clean simply by using clean coding practices and our hard-won knowledge of strange things that cause mono (but not .NET itself) to generate garbage (see: foreach operator).
Figured this was worth a writeup since some folks are going to be getting into the game soon, and the first question I would have is “what’s with this hitching?” Ironically, the worse your computer is, the less hitching you’ll have right now. 😉
But anyway, I figured an explanation of what it is as well as why it’s going to be going away very-soon was warranted. Cheers!
Spent a fair bit of time yesterday, and then this morning, looking at the speed of some of our most core alternate-math work. Floating-point math is too slow as well as being inaccurate between various computer architectures.
So back around 2009, I switched to fixed-int math in AI War on the advice of my friend Lars Bull. There weren’t any C# implementations at the time, so I wound up porting some professor’s Java implementation over for my purposes.
Over the years, Keith LaMothe (Arcen’s other programmer) and I have made a variety of improvements to it. We also had to re-implement a lot of trigonometry functions and things like square root, but that’s okay because performance stinks on those anyway in general. So we took a middle-ground between performance and precision.
Our versions give 100% reliable results, but they only give a certain number of significant digits – which is perfect for fixed-int math anyhow, because that inherently has a digit limit of (in our case) three places after the decimal.
Anyway, we haven’t really touched that code for years; we squeezed all the performance we could out of it long ago, and so we’ve been focused on multithreading, algorithm optimization, and a variety of other things.
Turns out that now things are SO efficient that some of the conversions and implicit-operators that we created for FInt (our fixed-int class) are now actually the bigger enemies again. There’s always a bottleneck somewhere, and these still aren’t the biggest one, but I managed to shave 2ms per frame off of a 12ms process just by working on the fixed-int math and then also a new random number generator, too.
That’s very worth it! The only reason it was that high is that we’re talking about a lot of math calls in these particular frames: tens or hundreds of thousands, depending on what is happening.
Regarding random number generation, there’s always a war there between accuracy (how random is it, really) and speed. I’ve never been all that happy with System.Random in .NET, and in Mono the performance is even worse. Though benchmarks suggest that the Unity 3D version of it is actually more performant than either the basic .NET or Mono versions. So… way to go, guys!
Still – we’ve been using Mersenne Twister (algorithm by Takuji Nishimura, with some credit to Topher Cooper and Marc Rieffel, and with a C# implementation by Akihilo Kramot) since around 2009, and that’s both slightly faster as well as a lot more random. In the intervening years we made a few small tweaks to make it run even faster, but it’s incredibly minor and mostly doesn’t help much in the grand scheme. Still, every little bit, right?
The problem is, sometimes you need quality randomness, and sometimes you need ultrafast randomness. There are certain things (like delays on special effects appearing, or shots emitting) that really don’t need all that much quality. We’re talking about a delay that is on the order of a hundred or thousand ms that we’re calculating randomly, and there’s just only so much randomness that you can see there. It’s not like AI decisions, which need to be quality.
So today I ported in a version of Xorshift (discovered by George Marsaglia) based on an implementation by Colin Green. I implemented that as an alternative for places where we want superspeed and don’t care if we’re getting super-duper high quality randomness as an output.
Well – it delivers! As noted, between that and the FInt shifts that was 2ms off a 12ms process, which is huge in terms of percentages.