by mh » Thu May 19, 2011 8:45 pm
Just getting rid of sequential poly on it's own isn't enough; you also need to sort properly by texture AND lightmap to get as many polygons as possible between each state change (this assumes multitexture). That's critical; GLQuake is deeply criminal in calling glBegin/glEnd in tight inner loops in so many places and that's a huge perf drain. BGRA is critical, and so is moving your lightmap updates out of the main rendering loop. GLQuake also does the worst possible thing ever here:
- draw with a lightmap
- stall the pipeline so that the draw can finish before updating
- update a tiny subrect of it
- stall the pipeline so that the update can finish before drawing with it again
Several hundred or thousand times per frame. Again, multitexture path only. (Is there any point in even supporting a single texture path in a modern engine? If you've got a Voodoo 2 or better you've got multitexture. Dump the legacy, clean up your code, give yourself less headaches, less bugfixing and less code duplication.)
Any engine which follows GLQuake closely enough in this part of it's code invariably ends up doing the very same thing, which is one of the reasons why some engines are known to get horrifically slow when dynamic lights go off. (Another reason is using GL_RGB for the lightmap format - OpenGL needs to convert this to BGRA in software before it can upload. Skip the conversion step and just use BGRA natively. And no, BGR isn't enough. Textures are 4 components, GPUs can't address in groups of 24 bits. GL_UNSIGNED_INT_8_8_8_8_REV instead of GL_UNSIGNED_BYTE will get you a direct DMA transfer, but you only notice the perf gain on really bad hardware as BGRA on it's own removes 99% of the bottleneck in texture uploads on anything halfway decent.)
Break away from that crap and you likely don't even need vertex arrays. They'll definitely get you more performance, but you may find that you're fast enough for your own needs without them.
Another VBO-related trick. Put an entire brush model into a VBO. Sorted by texture and lightmap. Draw the entire thing in a coupla calls (you'll need glDrawElements for that). Don't bother with backface culling (SURF_PLANEBACK stuff); the GPU will do that for you anyway, and since the data is already on the GPU you're not saving bandwidth by culling surfaces. You do spend some extra on the vertex pipeline, and a handful of extra bytes for command buffer entries, but on balance it's about 1.5 to 2 times as fast in scenes with heavy use of brush models.
You should pick a cutoff point here where you fall back on the old way; something like models with less than 20 surfs in them worked out good in my own tests.