Here's the entrance to the main central hall in E1M3 with some torch coronas shining. Notice how all visible torches have a corona which is (correctly) not depth tested, but the two torches off to the right (beside the gold key door) don't show.
If the obscuring wall was a brush model, they would still not show. Using traces, that part would not work correctly.
This gives the same end result as using hardware occlusion queries, but it's faster and will work on a card that doesn't support hardware occlusion queries.
Tuesday, March 18, 2008
The proof of the pudding...
Posted by
mhquake
at
10:17 PM
3
comments
Monday, March 17, 2008
Entity Occlusion
It's there and it works. Pretty damn beautifully, even if I must say so myself.
I was able to get the Z Buffer updates down to 10 times per second, which gave another handy speed boost. I could have got away with 5, but the increase wasn't too much, so I decided to run with the extra accuracy.
Here's the code: a lot of it is engine-specific, but if you can pull anything useful from it, it's my pleasure.
float r_z_update_time = 0.0f;
#define Z_UPDATE_INTERVAL 0.1f
#define Z_UPDATE_SIZE 64
// software z buffer
float zBuf[Z_UPDATE_SIZE * Z_UPDATE_SIZE];
/*
==================
R_ProjectPoint
project a point from world co-ordinates to screen coordinates
==================
*/
void R_ProjectPoint (vec3_t vin, vec3_t vout)
{
float fvin[4] = {vin[0], vin[1], vin[2], 1};
float fvout[4];
float *mm = r_world_matrix;
float *mp = r_world_project;
// transform our points - fvin will hold the final transformation
fvout[0] = mm[0x0] * fvin[0] + mm[0x4] * fvin[1] + mm[0x8] * fvin[2] + mm[0xc] * fvin[3];
fvout[1] = mm[0x1] * fvin[0] + mm[0x5] * fvin[1] + mm[0x9] * fvin[2] + mm[0xd] * fvin[3];
fvout[2] = mm[0x2] * fvin[0] + mm[0x6] * fvin[1] + mm[0xa] * fvin[2] + mm[0xe] * fvin[3];
fvout[3] = mm[0x3] * fvin[0] + mm[0x7] * fvin[1] + mm[0xb] * fvin[2] + mm[0xf] * fvin[3];
fvin[0] = mp[0x0] * fvout[0] + mp[0x4] * fvout[1] + mp[0x8] * fvout[2] + mp[0xc] * fvout[3];
fvin[1] = mp[0x1] * fvout[0] + mp[0x5] * fvout[1] + mp[0x9] * fvout[2] + mp[0xd] * fvout[3];
fvin[2] = mp[0x2] * fvout[0] + mp[0x6] * fvout[1] + mp[0xa] * fvout[2] + mp[0xe] * fvout[3];
fvin[3] = mp[0x3] * fvout[0] + mp[0x7] * fvout[1] + mp[0xb] * fvout[2] + mp[0xf] * fvout[3];
// prevent division by 0
if (fvin[3] == 0.0) fvin[3] = 0.000001;
// normalize
fvin[0] /= fvin[3];
fvin[1] /= fvin[3];
fvin[2] /= fvin[3];
// map x and y to range 0..1, then scale to buffer dimensions
vout[0] = (fvin[0] * 0.5 + 0.5) * Z_UPDATE_SIZE;
vout[1] = (fvin[1] * 0.5 + 0.5) * Z_UPDATE_SIZE;
// scale to the depth range we're using
vout[2] = (fvin[2] * 0.25 + 0.75);
// move points outside the image into the image
if (vout[0] < 0) vout[0] = 0;
if (vout[0] >= Z_UPDATE_SIZE) vout[0] = Z_UPDATE_SIZE - 1;
if (vout[1] < 0) vout[1] = 0;
if (vout[1] >= Z_UPDATE_SIZE) vout[1] = Z_UPDATE_SIZE - 1;
}
/*
==================
R_ProjectBBox
project a bounding box from world coordinates to screen coordinates, then take a 2D
"bounding box of the bounding box" for use in the occlusion culling tests
==================
*/
void R_ProjectBBox (float *mins, float *maxs, float *minsout, float *maxsout)
{
int i;
// initial corner points
minsout[0] = minsout[1] = minsout[2] = 999999999;
maxsout[0] = maxsout[1] = maxsout[2] = -999999999;
for (i = 0; i < 8; i++)
{
vec3_t bboxptin;
vec3_t bboxptout;
// get the correct corner to use
bboxptin[0] = (i & 1) ? mins[0] : maxs[0];
bboxptin[1] = (i & 2) ? mins[1] : maxs[1];
bboxptin[2] = (i & 4) ? mins[2] : maxs[2];
// project to screen
R_ProjectPoint (bboxptin, bboxptout);
// store min and max
if (bboxptout[0] < minsout[0]) minsout[0] = bboxptout[0];
if (bboxptout[1] < minsout[1]) minsout[1] = bboxptout[1];
if (bboxptout[2] < minsout[2]) minsout[2] = bboxptout[2];
if (bboxptout[0] > maxsout[0]) maxsout[0] = bboxptout[0];
if (bboxptout[1] > maxsout[1]) maxsout[1] = bboxptout[1];
if (bboxptout[2] > maxsout[2]) maxsout[2] = bboxptout[2];
}
}
int R_BoxInFrustum (vec3_t mins, vec3_t maxs);
void R_RunOccludeEntityTest (entity_t *ent, vec3_t mins, vec3_t maxs)
{
vec3_t screen_mins, screen_maxs;
int x;
int y;
R_ProjectBBox (mins, maxs, screen_mins, screen_maxs);
for (y = screen_mins[1]; y <= screen_maxs[1]; y++)
{
int p = y * Z_UPDATE_SIZE;
for (x = screen_mins[0]; x <= screen_maxs[0]; x++)
{
if (zBuf[p + x] > screen_mins[2])
{
// not occluded
ent->occluded = false;
return;
}
}
}
// occluded
ent->occluded = true;
}
void R_RunOcclusionTest (void)
{
int i;
entity_t *ent;
vec3_t mins, maxs;
if (!r_worldentity.model || !cl.worldmodel) return;
for (i = 0; i < cl_numvisedicts; i++)
{
ent = cl_visedicts[i];
// not occluded
ent->occluded = false;
switch (ent->model->type)
{
case mod_brush:
case mod_alias:
case mod_sprite:
// get entity origin
VectorAdd (ent->origin, ent->model->mins, mins);
VectorAdd (ent->origin, ent->model->maxs, maxs);
// do the bbox cull here
if (R_BoxInFrustum (mins, maxs) == FRUSTUM_OUTSIDE)
{
// occluded
ent->occluded = true;
}
else
{
// test for regular occlusion
R_RunOccludeEntityTest (ent, mins, maxs);
}
break;
default:
break;
}
}
}
void R_CaptureDepth (void)
{
texture_t *t;
extern texture_t *texturelist;
extern float r_farclip;
// accumulate update time always
r_z_update_time += r_frametime;
// don't update if it's not time to do so yet
if (r_z_update_time < Z_UPDATE_INTERVAL && r_framecount > 5) return;
// begin the timer again
r_z_update_time = 0;
// render at Z_UPDATE_SIZE x Z_UPDATE_SIZE in the bottom-right corner
// create the viewport for the capture
R_SetupGLViewport (vid.glwidth - (Z_UPDATE_SIZE * 2), Z_UPDATE_SIZE, Z_UPDATE_SIZE, Z_UPDATE_SIZE, r_refdef.fov_y, 4, r_farclip);
// store modelview and projection matrixes for reuse
// fixme - do this in software to prevent a sync-wait
glGetFloatv (GL_MODELVIEW_MATRIX, r_world_matrix);
glGetFloatv (GL_PROJECTION_MATRIX, r_world_project);
// set up the depth range for the capture
// we can use a good chunk of the depth buffer here
glDepthFunc (GL_LEQUAL);
glDepthRange (0.5f, 1.0f);
glDepthMask (GL_TRUE);
// shut down everything we don't need for this
glDisable (GL_TEXTURE_2D);
glColorMask (GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE);
// base vertex arrays
vaEnableVertexArray (3);
for (t = texturelist; t; t = t->texturelist)
{
// get the texture chain
msurface_t *surf = t->texturechain;
// no surfs in use
if (!surf) continue;
// skip over these surf types (fixme - this is ugly)
if (surf->flags & SURF_DRAWTURB)
{
if (surf->flags & SURF_DRAWOPAQUE)
{
}
else continue;
}
if (surf->flags & SURF_DRAWSKY) continue;
// walk the chain
for (; surf; surf = surf->texturechain)
{
glpoly_t *p;
// draw polys here as we're sending some liquids through it too
for (p = surf->polys; p; p = p->next)
{
int i;
glvertex_t *v;
vaBegin (GL_TRIANGLE_FAN);
for (i = 0, v = p->verts; i < p->numverts; i++, v++)
vaVertex3fv (v->tv);
vaEnd ();
}
}
}
// done with the render
vaDisableArrays ();
// capture the depth buffer
// per the spec, this scales to a 0..1 range, irrespective of the actual depth range
// (http://www.opengl.org/documentation/specs/man_pages/hardcopy/GL/html/gl/readpixels.html)
// but this is a lie...
glReadPixels (vid.glwidth - (Z_UPDATE_SIZE * 2), Z_UPDATE_SIZE, Z_UPDATE_SIZE, Z_UPDATE_SIZE, GL_DEPTH_COMPONENT, GL_FLOAT, zBuf);
// bring stuff back up
glEnable (GL_TEXTURE_2D);
glColorMask (GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
// glColorMask leaves the current colour state undefined
glColor3f (1, 1, 1);
}
Posted by
mhquake
at
6:27 PM
0
comments
Sunday, March 16, 2008
Premature Optimization
One of the golden rules of programming is to avoid premature optimization, but yet with my Z Buffer capture I have not only optimized prematurely, but have deliberately set out to do so. This was for a number of reasons.
Occlusion queries are a fairly tried and trusted technique, but the current implementation requires a pipeline commit before you can read back the data. For multiple entities, where you only want entities to be tested against world geometry (i.e. not against other entities) that translates into multiple pipeline commits. On one of my test machines, the net result is that implementing hardware occlusion queries causes a drop in FPS from 230 to 170. In many cases, this is acceptable as you can win back by skipping subsequent rendering. In Quake, the win back is not sufficient to justify the loss.
The primary goal of this technique was to implement occlusion. The secondary goal was to do so without any appreciable performance loss. Achieving the primary goal is relatively easy, any first year student could write the code, and even a non-programmer could sketch out the basics. Pythagoras knew how to do it. However, doing it in an acceptable and feasible real-time system is not easy.
As soon as I knew that I wasn't going to be able to get a fully-in-software implementation working, and I soon as I made the decision to walk away from even attempting it, I knew that any solution would have to be optimized like crazy. At that point in time I had fully intended to abandon it completely, but something about glReadPixels and GL_DEPTH_COMPONENT kept nagging at the back of my brain. The performance loss from glReadPixels is from two areas: the pipeline commit and the actual read back. If the impact of these could be minimized, then I would have a viable solution.
From then on it was a case of 2 + 2 = 4. Since a highly optimized solution was part of the basic requirement, it followed that any initial prototype would have to be optimized from the outset. Otherwise there was no point in even continuing beyond the depth buffer capture stage. So in order to meet the basic requirement, I broke the rules.
I suppose that the moral of this story is that the old rule of "premature optimization == bad" still stands as a good general rule, but it's important to realize that it's not universally applicable, and that when cases arise where optimization is required even at proof-of-concept stage, you need to sit down and consider whether or not it actually is premature.
Posted by
mhquake
at
7:16 PM
0
comments
Saturday, March 15, 2008
More on Linux
I've been following some discussions over on The Daily WTF with interest. It started out as a simple question on why Windows didn't include a command-line "sleep" tool in a default installation (which was quickly answered), but fairly quickly degenerated into a mud-slinging contest. It's only a matter of time before Godwin's Law is invoked.
One interesting thing about these discussions is that it's normally the Linux devotees who do most of the mud-slinging (although in this particular case, there are some reasonable people who seem to have thought things through on that side of the fence). I couldn't even begin to list the number of anti-Windows arguments I've seen in the past that have been based on things that may have been true in 1992 (but are no longer), or that are outright falsehoods.
This is sad because Linux has lots of strengths in lots of important areas. Yet it's adherents seem either unable or unwilling to sell it on those strengths, and instead resort to highlighting weaknesses (or perceived weaknesses) of the competition. It gives the impression to an outsider looking in that they really don't have confidence in their favoured platform; that they view it as "the best of a bad lot" rather than "the best, period".
Why is this, I wonder?
There has been a colossal push to get Linux established as a viable alternative desktop platform, but even it's most loyal devotee (I'm excluding the rabid/fanatical types here) would admit that it's still not ready. Ubuntu is probably the nearest, but that is still riddled with quirks and difficulties that would be deal-breakers to the typical desktop user. This is all deeply rooted in Unix culture, where there is an implicit assumption that the person using an OS would be intimately familiar with the inner workings of that OS. This is no longer the case, and hasn't been so for well over a decade.
For both platforms there is a Wall that the Hypothetical Typical User will eventually hit, beyond which they cannot progress without making an effort to improve their skills (this may come as a surprise to some, but most HTUs have no inclination whatsoever to improve their skills). Linux places that Wall far far nearer to the user than Windows does. A lot of energy has been wasted in the Linux camp on hot air, hyperbole, FUD and scare tactics. Maybe it's about time that this energy was redirected into something positive and productive, of benefit to everyone, and directed at achieving the goal of putting Linux on the desktop. Like pushing that Wall back.
Or maybe, underneath it all, the Linux camp (who seem to be more motivated by ideology than by practicalities) are really just plain old fashioned not interested in getting there?
Posted by
mhquake
at
1:36 PM
0
comments
Thursday, March 13, 2008
Z Buffer Capture
Got it working :)
The screenshot on the right shows the capture for the start hall. I've mangled the intensities (and inverted the range) so that you can see things a bit clearer. The image is also somewhat larger than I will use for production; again, this is just for demonstration purposes.
Further performance optimizations. I only capture the Z buffer under the following conditions:
- The viewleaf has changed (always capture, no matter what).
- We're in the first few frames of the map.
- The view origin or angles have changed significantly.
- 0.1 seconds have passed (i.e. capture at 10 FPS).
It's good to be back on track with this.
Posted by
mhquake
at
9:34 PM
0
comments
Wednesday, March 12, 2008
glReadPixels benchmarks
I've been benchmarking various implementations of glReadPixels, to get a feel for what kind of performance hit I'm going to take by using it for getting the depth buffer. It's not that bad at all. Here's some findings and observations:
- glReadPixels performance is comparable to using occlusion queries. The performance hit is almost identical. As I'm going to be coding this path anyway, I now have no reason to code a second path that uses occlusion queries. I'll be going glReadPixels all the way.
- There is no difference between placing the glReadPixels call before the main render or after it. This will be caused by the fact that I'm not using glFinish, so I deduce that a full pipeline flush is happening in both cases.
- Placing a few glFlush calls throughout the main render (e.g. after each texture chain and alias model that is rendered) can dramatically reduce the performance impact, as there is less of a pipeline stall when the time comes to do the glReadPixels. This is the single most effective thing that helps performance - without glFlush I lose 23% FPS, with it I only lose 5%. It's well worth investigating this more to find the optimal amount of calls (and places to put them).
- I'm using Jay Dolan's recursion avoidance technique, so I'm only doing a glReadPixels on each frame that I do a full recursion on; otherwise I assume that the previous frame's depth buffer is good to work with. This doesn't help performance as much as the glFlush technique (about 1% to 2% gain).
- There is no difference between reading a 10 x 10 chunk and a 64 x 64 chunk; the performance impact comes more from the pipeline flush than the size of the buffer that is read back.
- I only do a glReadPixels every other frame, the rationale being that there's not going to be much in the way of difference between the depth buffers for 2 consecutive frames (at least for the purposes of this exercise). This virtually eliminates the performance hit.
- Benchmarks were all done using a timedemo, meaning that the recursion avoidance and usefulness of glFlush will be somewhat less than in real gameplay.
The next part will involve actually putting something into the depth buffer that is captured!
Posted by
mhquake
at
11:50 PM
0
comments
Tuesday, March 11, 2008
No software occlusion
It's still not working out for me, so I'm going to bite the bullet and accept a performance penalty. This has just delayed further work for far too long, and I have spent too much time on reading and research, with very little to show for it.
The current plan is to do it in hardware. I have two ideas, one of which is traditional hardware occlusion queries, the other of which is very non-traditional. I'm interested in comparing the performance penalty for each, as I think that with the non-traditional approach I can do a lot of useful stuff to colossally minimize the impact, and potentially outperform occlusion queries.
I'm going to outline my non-traditional approach here, as it's basically using the hardware to accomplish what I had originally intended doing in software.
- Render the entire scene into a small viewport. I think I can get away with something extremely small here, like in the order of 128 x 128.
- This render is done with texturing, colour and anything else that can be, switched off. All we're interested in getting is the depth information.
- Use glReadPixels to get the depth buffer back into software. This will implicitly flush the pipeline and stall during the read operation, but I'm fairly confident that at the point I do the readback, the flush will have minimal impact. The smaller size of the scene will help to minimize the impact of the readback too. I'm interested here in comparing multiple glReadPixels operations (e.g. of only the portions of the viewport that we actually need for each entity) versus a single glReadPixels operation. The pipeline flush is going to happen anyway, but the former method may minimize the stalls.
- Everything else happens in software; computing of entity bounding boxes in view space and comparing them with the read back depth buffer information. I already have this code written.
Posted by
mhquake
at
11:01 PM
0
comments
Saturday, March 8, 2008
D'oh!
Homer Simpson would be proud of me. All of the work I had done on software occlusion testing was completely down the wrong path. I had inverted the order of operation from what actually makes sense, had been doing a lot of unnecessary un-projection, and had completely missed out on the most obvious and simple way of doing a critical part of the process (which was actually sitting under my nose, jumping up and down yelling "here, here, look at me!" all along). Therein lies the value of taking a break, doing something completely different for a while, and letting Zen stuff work it's magic.
It's back on track now. This baby is gonna work, and not only that, but it is gonna be screamingly fast.
Posted by
mhquake
at
10:04 PM
0
comments
Sunday, March 2, 2008
Took a break...
The occlusion testing has me totally stumped at this stage, I'm afraid. The main problem I have is that I can certainly do it, but it's just too slow. While I could use the old traceline based idea, that's unfortunately no good at all with brush models, and that means it's no good for me, as I intend using this for being able to draw torch coronas properly. It's also essential for sprites.
I could use hardware occlusion tests, but they stall the pipeline as they have to read data back - aaaaarrrgghhh!
Logic tells me that a fast software implementation is possible, but right now it's been holding things up for too long, so no coronas and rubbish looking sprites for now, I'm afraid.
Posted by
mhquake
at
9:22 PM
1 comments