I took some time today to dig through the documentation on IDirect3DQuery9, and now have the naive version of occlusion queries successfully implemented in DirectQ. As I suspected I was doing things the wrong way, and also the information on what the right way was had been buried a little (although fortunately not as deeply as I had originally feared).
With the naive version we're basically issuing the query and testing for it's result in the same frame (immediately after issue actually, so there is no delay to allow the query to flush). This requires completely flushing the command buffer and stalling the GPU while fetching the result, which lops a clean 100 FPS off ID1 timedemos and effectively wipes out the speed increases I had gained since the alpha release in complex scenes. Useless in other words.
The next step is to implement the non-naive version. IDirect3DQuery9 is essentially set up as a mini-state machine, but it does not provide a means of testing it's state outside of directly querying the query (!) so I'm going to wrap it in either a struct or a class that will more or less manage that aspect of it for me.
If this doesn't bear fruit, I have found an article on software occlusion which treats each occluding object as a frustum and then uses frustum culling to manage the occlusion. Obviously a far more complex setup (not least because occluding objects can have more than 4 sides); at the very least I would need to spin it off in it's own thread and run it concurrently with the main render.
If you've turned on r_speeds 1 with the alpha release you may have noticed that the counts are quite a lot lower than with GLQuake. What I did here was increment each count for each DrawPrimitive rather than for each polygon that is rendered. With hindsight this was the wrong thing to do, and DirectQ r_speeds should be consistent with GLQuake (if for no other reason than to show the performance differences for any given set of values). I'm going to restore this to the way it should be for the beta.
I'll just make the excuse that I was interested in seeing how many Draw* API calls I had managed to save. ;)
UPDATE:
Aaaaahhhhh, success. Sweet sweet success.
Here's the deal:
- I currently only have occlusion queries on static entities. It would be slightly less trivial to put them on permanent entities, but I'm going to go for it anyway - it's not a deal-breaker if I fail though (they don't go on the viewmodel at all - for obvious reasons).
- It's probably not worth the bother putting them on temp entities. Most of these will be knight, vore or wizard spikes, or nails or bolts, and will therefore pass the tests anyway.
- Trivial entities don't get occlusion queries run against them at all. Running queries requires some state-changes (which I can batch up, but they're still there) plus drawing a bounding box for each entity tested. The bounding box requires 12 triangles, so I've set the "trivial mark" at 24 triangles in the model just to be sure.
- By their nature the results will lag a few frames behind what you see on screen. In practice you don't notice at all; there are no models suddenly popping into view or anything like that. The alternative is worse (the naive implementation).
- I don't run occlusion queries at all during timedemos, so don't go looking for timedemo speedups here because you won't get them. Reasons why are firstly because they actually slow down timedemos, and secondly because a timedemo runs so fast that by the time the results come in the scene will have changed so much that they are well out of date.
- I currently don't have a cvar to disable them, but I'm inclining towards creating one.
I now lock at 72 in those ne_tower scenes.
9 comments:
I wonder, since quake models are fairly low poly anyways, if it's worth rendering the actual models instead of a bounding box / lower poly bounding mesh on a low res surface for occlusion testing.. like in this example.. combined with frustum culling maybe to exclude the entities that are not even in the players POV anyways..
Congratz for reaching 72 fps on ne_tower btw! ;)
Just a quote from one of them GPU gem articles I referred to earlier:
"29.6.1 CPU Overhead Too High
We need to account for the CPU overhead incurred when we send rendering requests to the GPU. If an object is easy to render, then it may be cheaper to simply render it than to draw the bounding box and perform the occlusion test. For example, rendering 12 triangles (a bounding box) and rendering around 200 triangles (or more!) takes almost the same time on today's GPUs. The CPU is the one that takes most of the time just by preparing the objects for rendering."
Not really. The specific model I wrote this to address was a torch that had 386 triangles and compressed to 831 verts. A frikkin' torch! And there were about 60 of them in the PVS (and frustum) in one scene...!
There actually is value in using the full model itself rather than a bbox if it was previously not occluded, but right now bboxes are so trivial to render (esp. with both color write and depth write switched off) that the gain would be marginal and the trade-off would be a heap of extra complexity in your renderer.
...and believe me, the GPU Gems article can say what it likes but I got an extra 20 FPS out of this...
Reality beats theory any day. ;)
HA! Amen to that, preach on brotha! ;)
(thanks for satisfying my curiosity)
Some more explanation - I batch up alias models too. Those 60 torches? I draw all of them with a single API call. Achieving this means that I need to do matrix transforms in software on the CPU, so the CPU overhead of computing a bounding box is quite a bit lower.
Drawing the model and using that for the occlusion test would mean needing to break my batching so I'd lose frames from that.
The only real situation where the GPU Gems stuff would apply would be if there was only one of each model in a scene. It does happen, but it's rare enough and the overhead of identifying that situation (and writing a second rendering path to deal with it) doesn't seem worth it.
In general I find articles like that good for some "how to" stuff, and maybe getting started, but the theory in them has a tendency to assume a situation that's already ideally set up for what they're covering.
Thanks for the suggestions anyway. :)
Sir,
For the non-programmers, and general game players among us, when you write your blog posts, could you please give us a brief description of *what the things you write about and/or refer to actually are* - such as 'occlusion queries' and 'vertex buffers' - and how what you're doing with them here applies to us as game players?
For example, I didn't know what 'occlusion queries' actually were until I found this brief description of them: "Hardware Occlusion Queries (HOQ) have many uses. A HOQ is the process of determining whether one or more rendered objects are visible on a per-pixel level."
Brief descriptions like this one will make your blog a bit easier for the non-programmers among us to read and understand. Well... anyway, thank you for all the amazing, amazing work you're doing on this. I visit your blog site every day, and always look forward to reading about what you're working on for DirectQ. I'm looking forward to grabbing and playing the beta when it's released!
Sorry, sometimes I just get lost in my thoughts there. I also have a tendency to use posts as a platform for crystallizing ideas, which probably doesn't help either. ;)
These are some good tutorials explaining all these things..
Post a Comment