Wednesday, January 27, 2010

I think the maps menu is going to go

I'm not sure if anybody uses it, and the fact that I have autocomplete on the map command makes it slightly (but not totally) redundant. Overall it seems to be a big chunk of complexity, and recent (mis)adventures with level names have rammed that home quite well.

Of course I'm interested in hearing if anyone would be put out by this, as the decision is not fully made yet. Worst case at the moment is that I just comment it out for 1.8.2, stand well back, and wait for the screams.

An alternative is that "map" without any args pops up something kinda similar; this may be more acceptable (but it would mean a certain amount of new code...)

If this works out reasonably well enough I might do the same with demos, and then start clearing out the entire advanced menus structure. Menus are something that haven't been worked on for a while, and I'd to strip some things back in the interests of simplicity, better maintainability and less room for weird bugs to creep in.

Like I said, if you have any opinion either way now is the time to shout before I become committed to a decision.

Update

So it looks like it's going to stay. That doesn't put me out too much; the important thing is to establish clarity on what people want and whether or not any continued investment on my part in maintaining it would be worthwhile.

This brings up the next question which is how to present it. Ever since I implemented the simple/advanced menu split I haven't been happy that things like the maps menu are now hidden in the advanced menus. The long term objective here is to put DirectQ in a position where the advanced menus won't be needed and can be removed.

Putting it in Single Player seems logical. Multiplayer already has it's own map selector (which only applies if you're using DirectQ as a server - which you shouldn't). Connecting to a server will pop you into the map running on the server, so map selection doesn't apply.

Now... here's where it gets messy. The SP menu is a single image, and I'm not willing to provide a replacement image containing an extra option as it would break mod compatibility (if the mod supplies it's own SP menu) and also break external menu replacement textures. I'm going to need to see if there's something I can use as an addition that's visually similar enough so as not to jar, that comes with ID Quake, and that makes sense in the context of it being a maps menu.

I would also hope to add a skill selection screen here, by the way...

To work!

Update 2

We're out of options here. I've gone through all of the available graphics and there's nothing suitable. I really don't want to mess things up on mods that provide their own menu graphics (enough engines do that already, let's not make things any worse), and I really do want to retain the original SP menu graphic. I might try adding some small-text options below it and see if we get much of a visual discontinuity.

Bug Alert!

If you're playing a map with a level name longer than 127 characters you can expect DirectQ to crash if you die, bring up the scoreboard, exit the level or finish the deathmatch. Depending on your OS and other factors it might even bring Windows down with it (not confirmed).

This will be fixed in 1.8.2; right now it doesn't warrant an emergency release as (1) these maps are quite rare, (2) they will likely crash other engines too, and (3) the max level name length supported by ID Quake is 40. In other words I currently view it as a map bug rather than a DirectQ bug, but still one that needs fixing for the next release.

The absolute maximum level name for 1.8.2 will be 1023 characters; this is enforced by the map specification, by COM_Parse, and by most compiling tools. If a map somehow manages to sneak past this limit the behaviour will be "undefined".

1.8.2 will also force-truncate level names for display in the scoreboard on one of 3 conditions - whichever comes first of: (1) end of string, (2) a '\n' character, or (3) reaching a limit of 39 characters. This is to ensure that the scoreboard display doesn't screw up visually. The initial server message will still use the full "cute" name.

Level names in saves are restricted to 20/22-odd characters - this is enforced by the Q1 save format. I'm not going to develop a custom (and incompatible) save format.

So far I haven't fully decided on how to deal with "cute" level names in the maps menu, if you're using it. My instinct is to force-truncate to 20-odd, so that will probably be the approach.

Tuesday, January 26, 2010

Random Stuff

  • I've done some work on making occlusion queries behave themselves better in certain scenarios. Now, if an entity was outside the view frustum or not updated from the server in the last frame, it will be automatically considered "not occluded", even if the previous occlusion query result indicated that it was occluded. This code change is currently out for testing, but hopefully it should resolve some edge cases where an entity that shouldn't have been occluded got a false positive. It means dropping a handful of frames, but I've upped the cutoff point at which queries are run from 24 triangles to 96, which gains back and more.
  • The new water shader is quite nice; moving calculations from the pixel shader back to the vertex shader where possible gives quite significant speedups, and - in cases where the op is addition or multiplication - linear texture interpolation means no appreciable visual deterioration. I also reduced it from two sin instructions per pixel to one, but at a tradeoff of more instructions per vertex, which I figure is fair. I might convert this sin to a texture lookup at some point, but right now there doesn't seem to be a need to.
  • DirectQ currently doesn't use 32-byte-aligned vertexes; I have tested this a number of times in the past, and most recently today, but it's always turned out to be the case that the extra vertex submission load makes things an even balance.
  • I've just put in quite a cheesy hack to use a Shader Model 3 profile on hardware that supports it. There's no difference in the shaders used or in the visual output, but performance improves slightly.
  • While I remain fond of Intel graphics hardware (cheap, cheerful, predictable, good enough, common) the wildcard in this pack is still the 965. On paper it should be better than the 910 and the 945; in practice it falls quite far behind. The 945 is king of the Intel hill for DirectQ, and is capable of turning in some quite astonishing framerates (300+, despite no hardware T&L). The 910 is a little fighter, slower of course but still very capable. The 965 seems to be little more than a Vista desktop accelerator, having the bare minimum feature set required to render Aero but being borderline rubbish for anything beyond that. More recent models (4 series) are catching up to 945 perf, but still not quite there. DirectQ will still perform and behave better on any of these than any comparable OpenGL engine would.
  • It should go without saying, but if you encounter a DirectQ crash or rendering glitch your first action should be to update your drivers and your DirectX. Always. Likewise if you're overclocking or using any form of "tuned" or "optimized" driver - stop doing it and see if the problem goes away.

Monday, January 25, 2010

1.8.1 is Out

Owing to the fact that we also now have a crash bug in e2m3 I'm getting this out now. I had originally hoped to include a little more, but fixing this takes absolute priority.

Here's the full list of changes:

  • Forced a skybox unload on game change and revalidation of the current skybox every map change.
  • Removed "Loaded n skybox components" message from worldspawn skyboxes.
  • Fixed bad texture crash on e2m3 (and possibly others) (again).
  • Added missing call to SV_PushRotate.
  • Fixed crash with > 32MB of external textures in use.
  • Reworked game loading to load pak0 to pak9 before any other pak, pk3 or zip files.
  • Removed cheesy rapid fire dlight effect and fixed screen-shaking with lightning gun.
  • Increased temp memory buffer to 128 MB.
  • Added ability to delete a save from the save/load menus.
  • Fixed several crash bugs in the save/load menus when no items were present.
Get it here.

I must be mad. I've only just released this and now I've gone and done a very significant performance enhancement for 1.8.2; optimization of the water warp shader giving something between 10 and 40 extra frames in scenes with heavy liquid content. Sorry, but you'll have to wait for this one.

Release 1.8.1 Coming Soon

1.8.1 is going to happen sooner rather than later (timeframe in days) owing to the discovery of a fatal bug in my external texture loader (credit to Anonymous Adam for this). There are a few other things I want to tidy up for inclusion in it too.

Stay tuned.

Sunday, January 24, 2010

First 1.8.1 change is in

After the mad gallop to get 1.8.0 released I'm taking a slightly slower approach to 1.8.1, which at the moment I'm seeing as more of a "cleaning things up" release than anything else.

First change is with skybox handling; it's been noticed that skybox settings persist between maps and between games, leading to a situation where if game B had a skybox that wasn't present in game A, after switching to game A the sky textures wouldn't show.

I was aware of this prior to the 1.8.0 release (in fact it dates back to the first implementation of skyboxes in DirectQ) but hadn't fixed it for various reasons. It's now fixed properly; any currently loaded skyboxes are always forcibly unloaded on a game change, and the current skybox is always "revalidated" on a map change.

Thursday, January 21, 2010

1.8.0 Final is Out

Some hiccups getting there, but I see that other people have managed to download it by now so all is good. But for a while I was almost thinking there was some weird cosmic conspiracy at work to prevent this from ever being released...

So the full and and final 1.8.0 has made it out at long last. Enjoy, and keep the feedback, bug reports and all the rest coming in.

I forgot to mention in my readme that there are some people who deserve special credit for this one. If you've read the comments to the alpha and beta release posts you'll have a good idea who most of them are, but one who slips through even there is the mighty JohnFitz, AKA metlslime, AKA he-of-FitzQuake-fame.

Throughout the development of 1.8.0 FitzQuake was the engine I kept going back to for comparison; it was my gold standard and also the one I felt I needed to beat (you'll be the judge of whether or not I succeeded; I think I did in some areas but I have a way to go in others). I even poached some code and ideas from it for one particular part of the work.

Without such a high standard as FQ, and without the inspiration it provided, I can't even think about how 1.8.0 would have turned out, so a massive shout-out and ass-grab for Mr Fitz is due and well-deserved.

Release temporarily delayed

Sorry about this, but I seem to have picked up some bug with occlusion queries in my Release build. Somebody somewhere needs a good slap in the face for deciding that certain code behaves differently in Debug builds than it does in Release builds, and I make a habit of always testing Release builds on account of this.

I had toyed with the option of just releasing the Debug build instead, but that's not gonna do. This needs to be done right.

So while I'm tracking down and fixing this one, things will be postponed. If I can get it done today I'll release, if I can't - well, you'll know as soon as I do.

Update: Problem solved and all is well. The wrong protocol version had been sneaking into release builds causing static entities with a modelindex > 255 to be dropped. This is something that I'm going to need to go back and fix properly for 1.8.1 as it shouldn't have happened for the protocol version that did get used, but for 1.8.0 I've just corrected the version used in release builds.

I'm running the final full release build right now, one more run of the engine, then it will be up.

Expect 1.8.0 Full Release within 24 hours

One more rough corner knocked off; I can now load external 2D menu/HUD pics from /gfx as well as from /textures/gfx. The texture loader is marked for revisiting/revamping in 1.8.1, so I'm going to let this out into the world once I've completed some final in-house tests.

Meanwhile here's what LordHavoc's CaveTest2 map looks like running at playable framerates. Just to re-emphasise the point.



Yes, I god-moded. Wouldn't you?

Tuesday, January 19, 2010

DirectQ FPS - A Scandalous Secret Uncover'd (and some new features and fixes)

If you've paid attention to DirectQ's FPS counter you would have noticed that it gave a rate of 71 FPS instead of the full default 72 in most scenes. I had noted a while back that this was due to a rounding error, and that I would fix it when I could have been bothered to.

Today I decided to bother, and after some digging the bad news is that I'm not going to fix it, but the good news is that I can now explain what's going on (which with hindsight should have been obvious).

Quake by default runs at a maximum of 72 FPS, which means that every 13.888888 milliseconds a frame will run. However, Quake also used a CPU performance-based floating-point timer which was quite prone to drift and unevenness on modern machines. Some time ago (one of the 1.7s I think) I replaced this with a millisecond-precise integer (DWORD actually) timer; the same as that used by QuakeWorld and QuakeII (I haven't looked at this part of the Q3A code but I would guess it's the same there too).

Now the thing about integers is that they don't accept values in-between. So rather than ticking over at the 13.88888 millisecond mark, DirectQ ticks over at 14 milliseconds (it can't at 13 because 13 is lower than 13.888888). This tick over at 14 milliseconds equates to a framerate of 71.428571etc frames per second, and as DirectQ displayed it's FPS as an integer, it got lopped down to 71. So there you go; the basic laws of numerics worked against fixing this, and DirectQ actually updates ever so slightly slower than Quake (but at the same speed as QuakeWorld/QuakeII) - what's 0.111111111 of a millisecond between friends?

For 1.8.0 I went millisecond-precise as far as I could, which unfortunately wasn't too far as QC and the protocol both require floating point time input in seconds. However any floating point drift will only last for 1 frame, and DirectQ is immune to drift and instability from your CPU performance counters. I have also now added a decimal place to the FPS counter so that you can see in more detail what's going on.


Two new features were just added. I know I declared a feature freeze, but if the guy building the engine can't make his own decisions on what goes in, who can? Anyway, they were fun and very quick to code.

The first is allowing for alpha in weapon models during regular gameplay by using values of r_drawviewmodel between 0 and 1; if this is considered "cheating" because it shows more of the screen, then so should r_drawviewmodel 0 be. A server that forces r_drawviewmodel 1 will always override it anyway.

The second is adding some extra range to looking up and down; you can now go the full 90 degrees in either direction by setting cl_fullpitch 1 (it's 0 by default for consistency with ID Quake).
The last remaining rendering glitch has just been fixed, which was that I needed to force a texture change if a VBO fills up otherwise we would have invalid textures and possible white-flash and/or binary garbage when running large maps with r_novis 1. The full and final 1.8.0 build is therefore pretty much ready to roll, but I'm gonna sit on it for a few more days in case anything last-minute comes in. Plus it will mean that I get to play Quake with it before you can - HA-HA!

Beta Blues

Some bugs in, for those of you who missed them in the comments.

On some 3D cards it seems that I messed up a texture blend mode for alias models, so anything with a fullbright appears all grey instead. I think I have a solution for this and will update. (Update: it worked.)

r_novis 1 is kinda also messed up at the moment; I'm guessing that it's overflowing a VBO but will need to do further testing.

I've dropped the 1.8.0 release from being my "recommended" release as a consequence of these - just knew I was being a little premature with that!

Monday, January 18, 2010

1.8.0 Beta is Out

The Slipgate Complex
Castle of the Damned
The Necropolis
The Grisly Grotto
Ziggurat Vertigo
Gloom Keep
The Door to Chthon
The House of Chthon

Unless your Quake engine can run those the way they are meant to be run, no matter how fast or pretty it might be, it ain't worth a damn.

1.8.0 Beta made it through this evening and so it is now unleashed.

I now consider the 1.8.0 codebase to be feature-complete, so start yelling about what you'd like to see in 1.8.1; there will however be one full and final release of 1.8.0 before that, which will incorporate any bugfixes required between now and then.

DirectQ Gamma Creep

Today I finally tracked down and fixed what was probably one of the longest-standing DirectQ bugs: the dreaded "gamma creep" where your screen would gradually get brighter the more you ran DirectQ. This was quite subtle and most people probably wouldn't notice it as the brightness increments were very small, but over time they did build up (especially noticeable if you're developing and running the engine 30+ times in a single logon session).

This was an artefact of the way DirectQ originally handled gamma, plus the way it continued to handle gamma changes. When I originally implemented brightness changing I used D3D gamma, which it turned out did not work as documented (the documentation was not even complete as - while it mentioned the formats used - it said nothing about ranges) and didn't even work at all for windowed modes.

From there (roundabout 1.2 or 1.3 I would guess) I implemented Windows GDI gamma, but instead of tearing down the D3D gamma system I only did half an implementation, so that D3D gamma was still used for reading your gamma, but GDI gamma was used for setting it.

This got messy, so for 1.7 (or was it 1.6?) I switched it to pure GDI all the way, but - not willing to get stuck in and clean out what had evolved into quite an elaborate setup - I rescaled the values to match the ranges used when reading D3D gamma (writing D3D gamma turns out to use a completely different range - sigh) and kept the D3D gamma structs used for storing the values (they were already in the correct layout so they were convenient).

To add to this, for applying gamma changes I had just plugged in the formula used by ID for the old -gamma command-line option without checking that the output was the same as the input for a value of 1. Of course it wasn't (it was slightly higher).

Now, none of this would have mattered most of the time aside from the fact that DirectQ always went through the gamma-change path on startup. Even with values of 1. And it's shutdown path took the values that had been scaled down and scaled them back up. And wasn't even called if the video had failed to initialize.

All in all quite a bit of sloppy work, and a good lesson in the dangers of maintaining legacy code paths that seemed to work OK, even when you know they're ugly and need cleaning out.

This has now been completely ripped apart and simplified, with correct full native GDI gamma all the way (D3D gamma still sucks), using it's own correct data formats, no scaling down and back up, the gamma change formula corrected and triple checked to ensure that output exactly matches input for values of 1, no matter how many passes through it are made, the initial pass through the change routine removed, correct restoration of the initial monitor gamma made if video fails to start up, and the previous 3 or so different codepaths for changing gamma all cleanly consolidated.

Removal of nasty old legacy code does give one a good feeling.

Sunday, January 17, 2010

Expect the Beta release REAL soon

I just put some of the final touches on it today and have been putting it through it's paces with some test maps. The most recent items to go in were a migration of r_shadows 1 mode and the non-HLSL sky to my VBO setup and a fairly quick 'n' dirty fog implementation.

I've made my points about r_shadows before, but for now I'm leaving it in. I do remain quite unconvinced that it is something that's in any way essential for the Q1 experience, but those of you who like it seem to like it very much. I might at some stage - prior to the final 1.8.0 release - fix the sloped-plane issues; with the beta however if you see a bad guy walking on a sloped surface you can expect his shadow to be cut off. As I said before, nothing outside of shadow volumes or shadow maps will work on steps, so don't expect anything wonderful.

I don't know if I've mentioned this before but you can use intermediate values (between 0 and 1) for a more subtle effect.

Non-HLSL sky in the VBO was easy, nothing to say there.

Fog is largely based on the old 1.7.x implementation, with the usual cvars to control it. One feature of 1.7.x that I haven't ported is the interaction of fog with sky. Back then I faded off fog quite a bit when drawing sky, but this time I've just left it at full intensity. This is really something for the mapping community to call the shots on, and I have a feeling that with most maps the intention is for fog to be at full intensity on sky. I don't know for certain though.

Performance-wise there are still some scenes in some maps that stress this engine a little. By this I mean framerates might drop to 40-50. Dynamic light updating seems to be the sole remaining bottleneck I can deal with at the moment, but this is a bit of a balancing act. I'm fairly certain that it's my changed lightmap size, but reverting this would mean breaking my primitive batching, which would mean a return to framerates in the 20s.

As a compromise I included the r_fastlightmaps cvar in the alpha, but didn't say too much about it at the time. This can be used to help with dynamic lights by slowing down the rate at which they are checked for changes: instead of checking every frame it will check every second frame. My thinking is that at 72FPS, checking for lightmap changes 36 times per second is adequate.

A value of 0 is the same as regular Quake, 1 will check true dynamic lights (explosions/etc) every second frame, 2 will check animating lightstyles every second frame, 3 will check both every second frame.

Right now the default is 0, but I'm thinking of making it 3. Any comments?

Saturday, January 16, 2010

1.8.0 Beta Update

I haven't done anything for a coupla days but I'm hoping to wrap up the last few outstanding jobs on the occlusions system shortly, following which the Beta release should emerge some time over the next week.

(Update: as a postscript to the occlusion queries saga, I now have the full system implemented on all alias models and tested with vid_restart/etc. It's amazing, but in some maps it manages to exclude over 100 models from the renderer - an indication of how much extreme overdraw - or should that be "underdraw"? - was going on.)

My original intention for the Beta was that it was going to be largely feature-complete and that the only work required for the Final would be some fine-tuning and bug-fixes. That's not going to happen now, as I need to get the latest batch of optimizations out public and being tested as soon as I have something pulled together that I know stands a reasonable chance of working on as many systems as possible (right now there's a layer of error checking and capability testing that I haven't even started yet).

What you're going to get with the Beta will therefore be in a rough and incomplete state - not so much as the Alpha for certain, but there will still be some things missing. It will however include the full set of recent upgrades to the renderer, and will be capable of utterly decimating the Alpha so far as performance is concerned.

One item I'm currently chewing over is where to set the bar for entry level. Historically DirectQ has always required a 3 TMU 3D card, and with the recent batch of changes I'm moving towards requiring Shader Model 2.0 again (aside from very early versions only the 1.7.x line and the recent 1.8.0 Alpha didn't have this requirement).

I think this is starting to become reasonable as SM 4.0 is now mainstream and anything from a GeForce FX5200/Intel 910/ATI equivalent upwards is capable of SM 2.0. It would also enable full deprecation of some legacy fallback modes.

It's also the case that with 1.8.0 DirectQ has moved far beyond it's original goal of being a D3D port for people who couldn't use OpenGL engines (that slot is now filled by my D3D8 ports of other engines, and I do believe that a D3D8 QRack is going to happen without my involvement, which I am very happy to see). It's now a viable entity in it's own right, and is capable of throwing some punches that other engines cannot (yet - it's only a matter of time though).

No final decision, but as well as incremental upgrades to 1.8 (you can probably expect a 1.8.1, 1.8.2, etc) there is also the question of what 1.9.0 is going to look like, and I think that the answer (or at least part of it) will involve ditching those legacy modes. We'll see (and above all it will be fun finding out).

Friday, January 15, 2010

The final word on Occlusions

I don't think I'm going to bother with occlusion testing on the rest of the entities (monsters/etc). Having thought about it, it occurs to me that most of the time these will be moving (not always but it's a good enough general rule) and actively hunting for you if you've been spotted, so the chances of a query result being invalid on a frame-to-frame basis are going to be quite a bit higher (especially as you're also moving).

With static entities (mostly torches) the worst case is that a torch that should be visible won't actually be for about 0.01 to 0.02 seconds, or a torch that should be hidden will be actually drawn behind a block or wall for a similar amount of time.

One thing I think I might do however is test for results every pass through the main loop, not just on passes where something is rendered (Quake renders new frames 72 times per second only). This should be a quick enough operation and might even let the results come in a more timely fashion.

Otherwise the only work that's left is the grunt work of checking if your hardware supports occlusion queries, bypassing the tests if it doesn't, and tying them into the dreaded D3DERR_DEVICELOST situation (although I think I have that covered by having them auto-destruct and re-create on-demand if an unknown error occurs, but I need to be certain).

UPDATE:

I may have found a way to run them on all entities all of the time without any concerns re: visibility. It'll be at least a day now before I can get stuck in again, but I'll report when I'm ready.

Thursday, January 14, 2010

More Occlusions/r_speeds

I took some time today to dig through the documentation on IDirect3DQuery9, and now have the naive version of occlusion queries successfully implemented in DirectQ. As I suspected I was doing things the wrong way, and also the information on what the right way was had been buried a little (although fortunately not as deeply as I had originally feared).

With the naive version we're basically issuing the query and testing for it's result in the same frame (immediately after issue actually, so there is no delay to allow the query to flush). This requires completely flushing the command buffer and stalling the GPU while fetching the result, which lops a clean 100 FPS off ID1 timedemos and effectively wipes out the speed increases I had gained since the alpha release in complex scenes. Useless in other words.

The next step is to implement the non-naive version. IDirect3DQuery9 is essentially set up as a mini-state machine, but it does not provide a means of testing it's state outside of directly querying the query (!) so I'm going to wrap it in either a struct or a class that will more or less manage that aspect of it for me.

If this doesn't bear fruit, I have found an article on software occlusion which treats each occluding object as a frustum and then uses frustum culling to manage the occlusion. Obviously a far more complex setup (not least because occluding objects can have more than 4 sides); at the very least I would need to spin it off in it's own thread and run it concurrently with the main render.


If you've turned on r_speeds 1 with the alpha release you may have noticed that the counts are quite a lot lower than with GLQuake. What I did here was increment each count for each DrawPrimitive rather than for each polygon that is rendered. With hindsight this was the wrong thing to do, and DirectQ r_speeds should be consistent with GLQuake (if for no other reason than to show the performance differences for any given set of values). I'm going to restore this to the way it should be for the beta.

I'll just make the excuse that I was interested in seeing how many Draw* API calls I had managed to save. ;)

UPDATE:

Aaaaahhhhh, success. Sweet sweet success.

Here's the deal:
  • I currently only have occlusion queries on static entities. It would be slightly less trivial to put them on permanent entities, but I'm going to go for it anyway - it's not a deal-breaker if I fail though (they don't go on the viewmodel at all - for obvious reasons).
  • It's probably not worth the bother putting them on temp entities. Most of these will be knight, vore or wizard spikes, or nails or bolts, and will therefore pass the tests anyway.
  • Trivial entities don't get occlusion queries run against them at all. Running queries requires some state-changes (which I can batch up, but they're still there) plus drawing a bounding box for each entity tested. The bounding box requires 12 triangles, so I've set the "trivial mark" at 24 triangles in the model just to be sure.
  • By their nature the results will lag a few frames behind what you see on screen. In practice you don't notice at all; there are no models suddenly popping into view or anything like that. The alternative is worse (the naive implementation).
  • I don't run occlusion queries at all during timedemos, so don't go looking for timedemo speedups here because you won't get them. Reasons why are firstly because they actually slow down timedemos, and secondly because a timedemo runs so fast that by the time the results come in the scene will have changed so much that they are well out of date.
  • I currently don't have a cvar to disable them, but I'm inclining towards creating one.
And the really important part?

I now lock at 72 in those ne_tower scenes.

The most beautiful sight in the world...



Especially if the previous you had seen was this:



What this means is that my VBO locking optimizations worked and I pulled another 10-15 frames from the "difficult" ne_tower scene. The previous Marcher shot that I got 60 on is now locking at 72 on the same hardware (DirectQ's FPS counter is one off owing to a rounding error; I'll fix it when I can be bothered).

This is almost motivating me to start looking at occlusion queries again, as I definitely feel that I can get ne_tower locking at 72 as well. I'm so close and I know that if I get a solution to the heavy use of complex torch models I will be there or thereabouts.

A coupla things I didn't bother putting VBOs on: particles (I tried it and they were much slower - obviously this rendering paradigm doesn't suit particles too well), brush model alpha surfaces (I probably will sometime but it would mean writing the same rendering loop a fifth time over, so I need to start integrating first) and the 2D stuff (the number of texture changes involved here, and lack of vertex reuse, would likely give the same end result as particles, although I haven't tested). I also had the non-HLSL sky in one with previous releases but I took it out for the alpha; I'll be putting it back for the beta.

It bears repeating: on trivial scenes (i.e. ID1 maps) none of this will give you much of an increase in timedemos, and no increase at all in regular play. This stuff really flexes it's muscles when the going gets tough and Quake would normally be expected to start choking and dropping to framerates in the 20s (or below).

Wednesday, January 13, 2010

Even more on Vertex Buffers

I'm taking some time out to explain this as I came across some more "OpenGL vs D3D" posts while researching my occlusion queries woe.

DirectQ 1.8.0 beta will feature extensive use of vertex buffers (VBOs). OpenGL also features VBOs, but there are some subtle differences.

With OpenGL, you have 3 choices regarding VBOs:

  • Use VBOs and exclude everyone who's 3D card does not support them.
  • Don't use VBOs and lose your performance gains.
  • Maintain 2 code paths with all of the pain this entails.
Direct3D is not OpenGL.

With Direct3D you can completely write your code to use VBOs in a single code path. This code will work on any hardware that can run your engine. If your 3D card supports them in hardware, it will run them in hardware. If your 3D card doesn't it will run them in software, and you will be no worse off than if you hadn't used them.

Direct3D VBO code would even run on a Voodoo 2.

The very same applies to vertex shaders (though unfortunately not to pixel shaders). One code path, runs on anything, with graceful degradation to software (and still at very high performance) for devices that don't support them.

Direct3D vertex shader code would also run on a Voodoo 2.

With full Shader Model 3 support.

Non-HLSL Water Warp is going to go

Or at least the "correctness" aspect of it will go. This is now definite.

Bottom-line: it's slowing down map loads, preventing me from reintegrating my renderers properly, and eating up a huge chunk of memory. At this point in time continuing to support the correctness element is going to cause a lot of ongoing grief for what is essentially a legacy mode.

I'll still be retaining a non-HLSL water mode however, and I'm going to ensure that the surfaces are at least not static (maybe something like what DarkPlaces does), but the time has come to jettison the difficult parts and get it working from the same data structures as the HLSL mode.

UPDATE

The deed is done.

Non-HLSL warps are now just a simple circling flat surface. It doesn't look too bad, so at least something is present for the non-HLSL folks.

This saves somewhere between 3MB and 6MB of memory per map, depending on the number of warp surfaces in them (and the sizes of those surfaces). As a really nice bonus it now goes through exactly the same render path as the HLSL warp, so we're finally starting to see some code integration here.

While experimenting with r_hlsl 0 mode here I noticed that I had managed to mess up the sky. I don't know if that made it into the alpha, but as I'm planning on keeping the non-HLSL sky for now (it doesn't mess things up nearly as bad as the warp did) I'm going to need to fix it.

Late-Breaking

It occurred to me after I posted this that anyone who may be annoyed by what I've just done deserves a bit more explanation. Everybody has their own pet features, and as there is only one of me I can't possibly do them all. A clean and stable code-base is my priority number 1, and implementing something involves a lot more than just writing the initial code. It needs to be maintained as time goes on, bugs that crop up in it need to be addressed, I need to ensure that it lives happily with the rest of the engine and that something else I do doesn't upset it (or that it doesn't upset something else I do).

All of this eats into my time. In the specific case of the non-HLSL warp it required a different vertex and surface format, so retaining full support for it effectively doubled my workload so far as water is concerned, and would have led to a situation in the future where it required it's own ongoing handling completely separate from everything else.

It had gotten to the point where it was no longer sustainable, basically. In order to move forward I would have needed to go back and rewrite all the other surface code to conform to the non-HLSL setup (the other way around wasn't possible).

This is called "diminishing returns", and something had to go.

If you're still annoyed at this, just think that now I'll be better able to re-integrate the surface refreshes, meaning more time available for those other features you'd like to see - time which would have otherwise been spent on maintaining the non-HLSL warp.

Driving geometry submission even lower

I'm currently hatching together a plan to further reduce the overhead of vertex submission for the world (alias models are already optimal here). It occurs to me that in any given scene roughly 3-4 vertexes will contain the same x/y/z, and of those perhaps 75% will have the same texcoords (and textures). That's an awful lot of duplicate data going to the GPU. Now, I already index on a per-poly basis, and PIX counters show that I can submit over 2000 vertexes in a single batch in some real-world examples, but I think I can drive that down to about the 1000 mark.

The idea is to store a framecount with each vertex, and if the vertex was already used this frame we just need to submit the index with which it was used (this would also need to be stored). Otherwise we submit the full vertex and update it's framecount.

I haven't fully factored lightmaps into this yet however, and right now it seems as though they would completely scupper the idea, but I'm thinking that even if they do I could probably still obtain some useful benefit in the area of overall memory usage for the map (for example, I already have the dreaded warpc down to 29MB, and could probably get that down further to 20MB). Even so... I've already implemented a simpler version of the DirectQ renderer on stock ID GLQuake using the non-multitextured path, and the end result was almost on a performance par with DirectQ (both without VBOs, although bearing in mind that DirectQ has some heavier operations in some areas - particularly lightmaps - that drove it down here). It may well be the case that loss of multitexturing is a fair trade-off if the saving on submission turns out to be in excess of 50%. (In an ideal world APIs would let us index vertexes and texcoords independently, but neither D3D nor OpenGL support that.)

This would also open DirectQ up to 3D cards with less than 3 texture units. I don't anticipate that there are too many of these around any more; even the original GeForce 256 from over 10 years ago has 4.

Before I do any of this I need to establish some better sanity in my VBO lock/unlock operations. Right now I do an unlock/draw/lock with each and every texture change (I still get good performance despite this) and I need to batch these up so that I only unlock when the VBO is full, then follow that with multiple draws.

Generally when I do things like this I pick on the alias models renderer first. Everything is nicely self-contained in a single function, and overall it's simpler to work with. It's also brought me good luck with advances in the past; every advance I've successfully made here has also been of great benefit elsewhere.

Let's see if it happens again this time.

After recent adventures with the renderer...

...I did something a little more - shall we say - mundane today, but just as important in the bigger scheme of things.

Some of you may have noticed that the kill counts in my load/save menus could have had a tendency to be slightly wrong in some cases. Some of you may have even been too polite to tell me about it.

The cause of this was that I was trying to be fancy and read the kills out of the savegame comment rather than the globalvars lump, and doing even that wrong. It's fixed now, and being done the right way (globalvars).

In truth that code dates back over 2 years now, and believe it or not I wasn't totally comfortable with using COM_Parse at the time. When I came to do the load/save menus in DirectQ I just copied and pasted it pretty much as it was.

There are some other areas where the DirectQ code is crufty and old. It's texture loader needs to hang it's head in shame for example, and that's up for a good makeover shortly too. I don't yet know if it will make it to the beta, but I intend for it to be present in the final.

Tuesday, January 12, 2010

D3D Occlusion Queries are Eeeeeeeeevil

I haven't yet decided if they were born evil or if they're making a conscious effort to be evil, but evil is what they are nonetheless.

The trick with occlusion queries is to prevent them from stalling the pipeline while fetching the results. In order to do this I use a technique where I get the previous frame's results before issuing the current frame's queries (checking of course that there were results to get in the first place).

Now, there are certain things about D3D queries where things go bad here. In D3D, if a query has been issued it's results MUST be fetched before it can be reissued. Zere iz nein otzer OPTION!

So how does D3D respond if you accidentally do this? Does it fail silently? Does it discard the pending results and just reissue? Is there anything even remotely graceful or elegant about it's handling of the situation? What do you think?

No.

It locks your GPU.

Violently.

No further comment, m'lud.

FOLLOW-UP

In attempting to track this down I implemented the "naive" version of occlusion queries and just did an issue/getdata directly in the same frame. It still locks. Then I split the loop in 2, did a batch of issues, followed by a batch of getdatas. Still locks.

The current theory here is that there is some undocumented thing going on when you combine occlusion queries with VBOs, as this approach had worked before.

Or maybe it is documented but you need to join the dots yourself between 47 completely different and unconnected pages of the SDK, then follow a convoluted train of logic before you get to the info you need (just like so many other things, in other words).

Anyway - guess what D3D feature I won't be using? Maybe I'll find a way of doing this in software instead, as I really do want to be able to filter out those entities.

Sigh. It's times like this that I miss OpenGL.

60 FPS



And that's on my Intel.

This is actually a bit of a weird one, as almost exactly one year ago (I'm maybe an hour or so later) I released the first ever screenshot of DirectQ running Marcher. So here we are again.

The original post can be viewed here. The real irony in this is that I had finished it off with a discussion of DrawPrimitive vs DrawPrimitiveUP, and had come down in favour of the latter on grounds of flexibility and convenience. Oh how times change. Todays work has probably been one of the most productive in terms of performance ever (I think that deconstructing the alias model format, and handling brush models in huge indexed batches would beat it, especially as without those I would never have pulled the extra performance today) and it's all been VBO stuff.

Holy Bloody Jumping Jeepers WOW!

World surfaces just went across to my VBO. Utterly pain-free again. That infamous/notorious/despicable/insalubrious/etc ne_tower scene has just jumped to 60.

Marcher is still a quite painful 40-50, so I'm going to do elements of the sky next and see what happens.

Right now I'm quite amazed about 2 things. Firstly that it took me so long to uncover the correct way to do VBOs in D3D (why don't Microsoft have it written in HUGE letters on the first page of the SDK that THIS IS THE WAY TO DO IT?) Secondly that I got such an impressive gain from world surfaces. I had been working on the basis that vertex submission was not a bottleneck in my setup, that fillrate was where it was all at for the world, and that I had squeezed as much as I could out of the performance with big scenes. I was quite obviously wrong and there's a lot more still to come.

If you thought the Alpha was fast, just wait till you see the Beta.

Monday, January 11, 2010

Even more Vertex Buffer Joy/HLSL stuff

I've finalised the VBO setup; it turned out that a single huge VBO was the most optimal path for rendering in all cases, so that's the way it is. A pity as the rotating buffers API I had written was genuinely lovely.

I'm going to commence moving the world over to it shortly and we should start seeing some genuine fireworks. Even the pre-VBO alpha release was already quite adept at chewing up large amounts of scenery, but this should raise it to another level again.

D3D VBOs are really nice in that they will also work perfectly without any code changes on machines that don't support them in hardware (like D3D vertex shaders actually; in fact - and aside from matrixes - D3D has probably got a far nicer vertex pipeline that OpenGL could ever dream of).

...And speaking of shaders...

Straight-up fact: if you have hardware that is shader model 2.0 capable you should be using r_hlsl 1 mode, unless you have a very specific problem that updating your drivers and your DirectX doesn't fix. This includes even if you find r_hlsl 0 to be faster in timedemos. In terms of correctness it's the best mode, and it will be the mode that receives the most support, love and affection from me going forwards.

Don't be too surprised to see the correctness of r_hlsl 0 mode becoming deprecated as time goes by, especially as I start reintegrating the sky/water/world paths.

More on Vertex Buffers

I switched alias models over to my proposed new vertex buffer approach and - even without the cycling pool idea - got a nice performance boost (about 10 to 15 frames) from the infamous ne_tower scene (vertex and index buffers in D3D actually seem to be extremely intelligently designed for reuse multiple times per frame). Note that this is without the occlusion queries I was talking about earlier, so I think that on the appropriate hardware, and after I implement both the cycling pool and occlusion queries, I might be able to get this scene close to locking at the 72 FPS mark. Typical ID1 scenes won't see much benefit as they don't stress the hardware enough, however.

Thankfully the rendering design I had set up was able to make the changeover very painlessly. Just replace my memory buffers with the equivalent D3D buffer objects and it worked perfectly first time.

Phew!

The next big question is how large should the cycling pool be? Right now my buffers hit the 65536 objects (either vertex or index) per buffer mark, so with a pool of 8 of them we're looking at maybe 13MB of video RAM. That is nowhere near the last word on the subject though, as the buffer size does NOT need to be that large. I can easily go for maybe one 10th of the size and still retain full performance (my personal feeling is that one 100th might even be adequate), so increasing the pool size to something like 32 or even 64 would give less video RAM overhead (5MB and 10MB respectively at a buffer size of 6000) and enable more optimal throughput - just because D3D can reuse the same buffer multiple times per frame doesn't mean that you should (I think!!!)

Currently I've got maybe 8 DrawPrimitive calls for alias models in a typical ID1 scene, topping out at perhaps 40-50 in the most OTT huge maps. I don't see those figures changing much with a buffer size of 6000, so 32 seems a nice pool size. Although I eventually intend adding the world in there too, with that kind of pool size it should be more than ample time for drawing from any given buffer to complete before it's turn comes around next.

Of course these are all figures that I'll play around with and see what gives the best balance of performance vs storage efficiency. It may even turn out to be the case that I don't actually need the cycling pool at all and I'm already at my performance peak, but we'll see.

Thankfully the previous release was an alpha as it means I have more freedom to play around with the way things work and add and remove features as I see fit. There will be some minor consequences of this approach, such as the ability to switch hardware T&L on or off, as well as to control the variable max submission batch size being removed.

Anyway, overall what we have here is a neat and efficient setup that will boost performance in difficult scenes ever further, so it's quite a fair tradeoff, don't you think?

Vertex Buffer Strategy

Been thinking a little about the Vertex Buffer plan I wrote about recently. One idea that occurs to me - for the world and other brush models - is to fill a single static VBO with geometry data at load time, then use dynamic Index Buffers instead. So the world render path would just need to fill in indexes. I could cycle through a pool of maybe 6 index buffers, using a different one per frame (and keeping some spare in case an active one runs out of space) thus getting optimal use out of D3D's 3 frame thing. The only issue I can see is that I would definitely require vertex shaders here, but overall it should be extremely efficient.

This is all off the top of my head at the moment, of course.

Upcoming changes for the beta release

It won't happen for a while yet as some of the changes required involve a massive tidy-up job on the renderer. Right now, if you've looked at the source code, you'll notice that I have multiple versions of what is broadly the same code handling HLSL water, non-HLSL water, sky, solid surfaces and maybe a few other things. These all need to be integrated into a single path. The core concept of the renderer is quite solid but I'm going to be maintaining this code for the foreseeable future and the implementation is not quite up to scratch in general tidiness.

The current plan, in addition to integrating the paths, is to maybe avoid the memory copy operation I have by writing to a write-only dynamic Vertex Buffer instead. Vertex Buffers are normally not quite suitable for situations where geometry (and vertex data) changes on a frame-by-frame basis, but I think that this time round I can obtain benefit from use of one. Unfortunately here's where D3D's somewhat arbitrary-seeming usage restrictions are going to bite as I need to create it in the Default memory pool, meaning that it needs handling conditions set up for device loss. Yuck. Still, it should mean something of a performance boost.

However, experience has taught me that locking, filling, and using the same resource multiple times per frame is a guaranteed recipe for quite severe performance penalties, so two approaches spring to mind, the first of which is to use a really really really big vertex buffer, and maintain offsets for each use of it; while the second is to have a pool of smaller vertex buffers that is cycled through. On balance I wonder if the complexity is worth the effort, especially for world surfaces where fillrate and overdraw are larger bottlenecks than vertex submission. It seems worth a try for alias models however.

Speaking of vertexes, a bug has surfaced where some drivers do not respect the vertex stride parameter they are given during a rendering call. What's the point of even having a stride parameter if respecting it is optional? Sigh. Anyway, in some cases I use a stride larger than the FVF becuase either (a) I was lazy and didn't feel like creating a new vertex type at the time, or (b) I needed to cache some extra data that wasn't used during rendering but that was needed for set up. End result is that for cases where I might have an FVF of xyz/diffuse but a vertex layout of xyz/diffuse/other-data, and I specify the correct stride for the vertex layout, the driver ignores it and works off the FVF instead, resulting in the values in the "other data" members being used for (at least part of) the xyz for the next vertex. This generally happens for draw operations that retained the old 1.7.x and earlier code, like the 2D stuff (including the underwater warp), primarily. Not too difficult to fix, but annoying all the same.

I'm thinking of dropping support for the fixed pipeline path entirely. Complexities came in from requiring to support it for water surfaces where a totally different and non-compatible vertex/index layout was required (owing to surface subdivision) and it's just going to make things fiddly and awkward going forward. That doesn't mean that a full-blown HLSL-on-everything path is going to come back, but it does mean that the two surface types where there is currently a HLSL or fixed option (sky and water) might drop the fixed option.

I wonder if there is anything still in reasonably common use that doesn't support at least Shader Model 2.0? Even Intel GMA900 stuff supports it.

MAX_BEAMS has gone up to 29,125. Now if you need almost thirty thousand lightning bolts flying around in a map you will be able to have them.

Update/addendum: did I say 29,125? I meant 29,127 of course. Those extra 2 bolts might be crucial...

Sunday, January 10, 2010

1.8.0 Alpha is Out

OK, I've released it.

Coupla points.

Firstly, you should be aware that this release is functionally incomplete compared to 1.7.666c - don't go looking for the same level of effects in it. There are a few things I've left unfinished, and they're all explained in the readme that comes with the engine (you do actually read readmes, don't you?)

Secondly, now is not the time for feature requests along the lines of "your engine would be cooler if it supported modding/multiplayer features x/y/z". That's not the purpose of this release. By all means save them up and bring them out later, but please hold them back for now.

The purpose of this release is as a test of the new rendering capabilities on a wide range of 3D hardware and computers. I'd appreciate if all feedback could be confined to that aspect of it. There have been some very significant enhancements and changes throughout almost the entire rendering path, and I need to hear of any problems people might have sooner rather than later.

Don't go looking for flashy new graphical effects, you won't find them. DirectQ remains true to it's ID Quake heritage.

Do go looking for performance improvements in large and complex maps. You know those maps that drag other engines down to framerates in the 10s or 20s? This engine should gobble them up and come back looking for more.

That's it for now until the first announcements about the Beta appear.

Shadows

I finally knuckled down and tackled the disgusting piece of hackery that is Quake's shadow code. For now my objective was "just get them working" and after a few issues relating to integration of the old shadow code with my new setup and the order of matrix transforms they came out OK. They're quite fast but can never be as fast as rendering without them.

At the moment I haven't bothered with properly projecting them on to the plane or stretching them off to one side of the model.

I'm in two minds about this shadows thing. Carmack wrote in the original GLQuake readme that they weren't meant to be very robust, but just a neat little trick that he pulled off quickly enough and was rather fun to look at. Software Quake doesn't have them. Original GLQuake shadows didn't work right on sloped surfaces, and nothing outside of shadow volumes or shadow maps will work right on steps. If a model is just outside of your FOV it's shadow will abruptly disappear, even if it had been stretching half way across your screen in the previous frame. They hurt performance. They have nasty looking hard edges.

What this is leading up to is that I'm thinking of removing r_shadows at some point in the future. They're a novelty trick that makes you go "gee whizz" for about two seconds. If shadows are to be done at all they should be done right (DarkPlaces), and right now that's beyond the scope of DirectQ. Shadows (GLQuake style) really belong in the same big bag of discarded nastiness that mirrors, lack of fullbrights and lack of overbrights have long since been confined to.

I'll leave them in as they are for now, but don't be too surprised if you see them go sometime.

Illegible Server Message and Quoth

If you get an "illegible server message" with DirectQ, and you are using Quoth with a mod that doesn't require it, the solution is to not use Quoth.

I know it's a pack that a lot of people like, but it does provide a whole steaming heap of extra content that sits on top of your basic Quake content, and when you add a mod that Quoth hasn't been tested with into the mix (and it can't possibly be tested with all of them) you are increasing the possibility of a conflict somewhere.

That sucks I know but I can't possibly legislate for any conflicts that may happen between two different mods that weren't designed to run together.

More on Occlusion Queries

I set up a basic test of Occlusion Queries on the current framework, just to satisfy myself that the ideas I had here were viable. Result is that I can pull maybe an extra 20 FPS out of that really bad scene, and get minor gains everywhere else, so it's definitely something worth pursuing further.

I was able to completely bypass the normal pipeline stalls associated with Occlusion Queries by assuming that on a really small timescale things don't actually change that much from frame to frame, and that over 90% of the time the results for any one given frame are going to be valid for the next frame. My tests have confirmed this assumption to be valid, so it's the way forward. The sequence is therefore: read back the results for the previous frame; if they're not ready yet just use the last valid result, then issue a new set of queries (this last can be deferred until such a time as the previous results become ready, with a timeout of something like 2 or 3 frames where if they're not ready by then we discard them).

It should also be possible to further optimize by just not running queries against certain entities. An entity in the same leaf as the player is one candidate, another would be any entity who's model is sufficiently simple that it's not worth bothering. The overhead from the query far outweighs the cost of just drawing it regardless.

I'll be doing the first release without it though. Reason why is because of the way Quake handles reuse of entities; I need to be extra careful here as by the time the results come in the entity that the original query was run against may have completely changed! Any fool can get maximum performance if they're not worried about correctness (at the most extreme end of the scale it can be done by drawing absolutely nothing), and getting correctness without worrying about performance isn't that difficult either. Getting both together involves more work, and I'm not ready to include this feature without nailing it down better.

More 1.8.0 Updates

Shadows, fog and the Half Life texture palette are the only real items that remain outstanding in terms of the 1.7.x feature set. I'm working pretty flat out to get this ready for release and hope to have the first two of those items completed shortly.

To be honest I couldn't really care too much about the Half Life palette at the moment. If I heard that a mod team were working on a release that used the Half Life BSP format I could turn it around quickly enough, but until then it remains a novelty feature that's good for tech demos but not much else.

I'm of the opinion that if one was to look for an extended map format to either replace or complement Q1 BSP, Q2 or Q3A BSP are much better choices. Areaportals, light grid, lots of niggles with Q1 BSP resolved, GPL loading and rendering code - all nice. HL BSP just inherits far too much baggage and doesn't seem enough of a step forward to warrant much investment - until such a time as it's used in the Real World, of course.

Back to DirectQ. I've also started moving certain allocations I had put into my permanent memory buffer over to flexible dynamic per-map buffers. Temporary entities were the first to move across, and the max number has now increased from 512 to 21,845. Others will follow, with similar increases (I have over 1GB of address space to play with here), but I don't know if I'll get any more done for the first release. The really nice thing here is that despite the stupidly large maximum, DirectQ will only use as much memory as it actually needs, on a per-map basis.

The ability to switch hardware T&L on and off has also been added. If you have a GPU that has barely sufficient capabilities to run the Aero desktop but not much else you might get a performance boost from doing your T&L in software. This can be done at any time and doesn't need a video restart. vid_hardwaretandl is the cvar, and it's also in the Video menu.

The "directq.cfg" file thing I talked about earlier has been added and the startup sequence is now default/config/directq/autoexec, with changes written to directq.cfg on shutdown. This should help DirectQ play nicer with other engines living in the same folder as it; now it won't stomp over their config settings and they won't stomp over it's.

I've been struggling to get better performance out of the ne_tower map, but those torches are killing me every time. The ones used are exceedingly complex models that don't optimize to an indexed mesh very well, and are scattered all over the place - over 20 of them in many scenes. There's one particularly brutal scene at the very bottom of the tower where if you look up and around your framerates will plummet, and it's entirely down to those torches - removing static entities from the render prevents it.

I'm going to console myself that I can't at present optimize for extreme cases, and the fact that I still manage a decent 35-45 FPS in this scene (other engines get 10ish, one engine drops to 1!) is enough of an accomplishment for now. I may well come back to it and hit it with occlusion queries at some point in time, but it would be a shame to have what I think is a "cop out" solution that won't work on all hardware.

Saturday, January 9, 2010

List of Outstanding Items

Just so you know (and in case I lose my own list...)

  • Fog!
  • Extra bmodels in automap
  • Hipnotic glass???
  • Half life palette loading
  • Non-HLSL sky alpha and transforms
  • Move texture loading to client???
  • Hack up nodes/leafs max to 64k???
  • Non-HLSL water subdivision in realtime???
  • Shadows.

The Final Countdown

The list of remaining outstanding items now occupies one A5 jotter page, and is being steadily marked off. Some of them are quite minor requiring a small code change here or there, some will need a bit more work to integrate properly, and some are crazy ideas that I might try out sometime but are not absolutely required.

All going well this should hit public release within a week at the most.

Friday, January 8, 2010

Tweaks and Changes

  • Particles are back up to speed. The main issue here is that I had been doing a Mod_PointInLeaf test for each particle in order to determine if the particle was underwater or not. This made sense before I had them properly sorted, but now it's just a slowdown. If you're curious, the way I depth-sort particles is based on the spawn point of each batch, so it's a little less accurate than individually sorting them but much faster.
  • I've tweaked the menus a little bit more. The option to toggle between simple and advanced menus has been removed (you'll need to do it via the console instead) and I've added a "Special Effects Options" under "Video Options" on the simple menu which contains a selection (about 10) of the most useful options from the full advanced menus.
  • I've removed the corrected lighting of ammo-boxes/etc. This one might go back in later at some point, but right now it wasn't playing nice with my new surface refresh (I had a workaround for that though), and was playing even less nice with maps that used this type of model for things other than ammo boxes. It's also not the way Quake is meant to look.
  • I've removed the anti-wallhack from the server. I've said it before, but DirectQ is not meant to be used as a server. It's a client that's capable of running as a server, but you shouldn't go looking for much beyond the most basic of server features in it. If you do use it as a server, make sure that you play with your friends.
  • Timers are now millisecond accurate with no drift on the host, and have been adjusted for rounding errors on the client and server side. It's quite a bit of disruption to introduce millisecond accuracy on the client and server, and right now I don't want to risk breaking things. The FPS counter is still off by one though.

Non-HLSL Water Warp is Done

It turned out to be quite easy after all. It's been possible to substantially accelerate the subdivided rendering, so you should be able to start with a gl_subdivide_size of 24 and still get reasonably decent performance. One key factor about this engine is that heavy surface tesselation does NOT have as huge a performance impact as before. It takes a short while to drop one's paranoia about this, but it's seriously liberating to realise that you can now run just about anything far more efficiently (sometimes up to 7 times as fast) than before.

I'm taking a short break from the renderer to work on DirectQ's timers. Standard Win/GL Quake used QueryPerformanceFrequency and QueryPerformanceCounter which - while incredibly accurate - were somewhat responsible for uneven timings here and there. ID moved to timeBeginPeriod and timeGetTime for QuakeWorld and retained them in QuakeII, and I had moved to the same for DirectQ 1.7, but only in the Sys_FloatTime function. This gave a timer with integer millisecond accuracy and was rock-solid at that point, but was slightly subject to FPU drift as times went along the chain and rounding errors accumulated.

My motivation for this at the time was to enable me to drop the D3DCREATE_FPUPRESERVE flag from my video initialization, which was otherwise required to preserve the accuracy of Quake's timers (otherwise things would go really screwy and we'd slow down to half speed over the course of a few minutes).

For 1.8 I'll be using DWORD millisecond values all the way through the chain up the final parts where unfortunately it will need to be converted to floating point. This will totally preserve the timer accuracy as much as possible and should give a nice end result.

One side-effect of this is that DirectQ will now be subject to the same old timer wraparound bug as Windows 95 and 98 were, where we'll go into "undefined behaviour" territory if DirectQ is running at precisely the moment when the system it's on ticks over from 47.something days uptime. It will only happen at precisely that moment, and afterwards everything will settle back down.

I might do an experimental setup to reproduce this situation and see what I can do to work around it, but I wouldn't anticipate that too many people will be affected by it.

Thursday, January 7, 2010

HLSL Alpha Water Warp is Done

That was fun. The structures I had set up for the world make this kind of thing really easy and fast now.

Water surfaces are included in the correct back-to-front alpha sorted list along with every other alpha object, and there's a mini state change manager at work. It's quite a nice looking piece of code too.

I also set up the state changes needed for the non-HLSL warp, but there's a bit of tidying up to be done first. For starters I've realised that my water alpha render path can be totally reused for regular non-alpha water, so I want to amalgamate the two. Secondly I have some unnecessary state changes and function params I need to remove. Thirdly is it doesn't integrate too well with particles as the regular non-HLSL state change manager intercepts the setup of a particle batch following a HLSL water batch and discards the changes. That bit's easy, I just need to write a particle state takedown function that ensures the correct states get triggered (or add it to the water state takedown so it would apply to other object types, which seems more correct).

Overall I'm quite pleased with the extent to which I've brought things on. I always knew that as soon as I managed to get the old legacy code out completely things would start moving fairly fast.

Progress!

HLSL opaque water warps are now done in a far more optimal fashion than before, with correct vertex batching in place. Sprites have come back - my mistake, I had commented out the line of code that added them to the render. One less obscure bug to track down.

The non-HLSL water warp is semi-done. I haven't written surface subdivision yet so right now it just uses the unsubdivided verts.

I think I'm going to do the HLSL alpha water warp path next; I want to get the structure for handling this in place before I go tackling the non-HLSL path properly.

Once I get these out of the way the sole major outstanding items will be alpha alias and brush models. Both of these are going to be utterly trivial to set up: a different state for the alpha op in stage 0 and spit out the verts. In fact alpha alias models will require maybe 5 lines of extra code as the basic structure for alias models has been set up with making them easy in mind. Brush models will require a different codepath to that used for opaque, but that's OK as I had fully expected it.

The new render should be functionally complete then. I haven't even touched shadows yet though, but they're going to be easy enough; particles are in some need of optimization but I'm not certain of the best approach yet. A few tweaks, make sure it doesn't crash, and I'll hopefully be releasing the Alpha version at that time.

I'm declaring the main surface refresh "done"

I could easily spend between now and the time when Romero is asked to rejoin ID polishing it further, but this thing would never get released and I'm sick to death of that particular code right now. There are still 3 intermediate buffers in use between the time when R_RecursiveWorldNode spits out the surf and the time when the surf is drawn, but this is all required to make the eventual render light years more efficient.

The first two of these are fairly lightweight, just building up lists of pointers and letting things fall out into the correct order for rendering; the third is something of a monster and takes copies of all of the verts in all of the visible surfaces for actually rendering from. I could probably get rid of it, but doing so would require reworking the surface loader and maybe a few other things. In terms of speed and space efficiency it's not actually that bad (and the gains from rendering outweigh the losses from buffering) but I'd still like to get rid of the copy operation. If Direct3D let me do interleaved indexed vertex arrays without requiring a hardware VBO I would do it right now, but the cost of locking/filling/unlocking a hardware VBO each frame is far too high, even for just an index buffer.

It'll do for now.

Currently on the "done" or "almost done but just needs a light coat of polish" list are:

  • Sky.
  • World Surfaces.
  • Brush Model Surfaces.
  • Alias Models.
  • Particles.
  • Underwater Warp.
On the "not done" list are:
  • Water/Slime/Lava/Teleport Surfaces.
  • Alpha Models and Surfaces.
  • Sprites.
These are in various states of semi-completion. Water/etc surfaces were partially implemented on an older setup, but need complete reworking. Correct back-to-front alpha sorting for any object is actually done, and particles currently go through that path (using some clever tricks to avoid sort overhead from large numbers of particles). I just need to add the code for other object types here. Sprites are also effectively done but right now have a bug where they just don't show up in the final render. They did yesterday, so no doubt it's something simple (ha ha ha).

There's also the ongoing process of building back up the DirectQ 1.7 feature set, bugfixing, tweaks to other parts of the code, and so on.

I can't quite say that the end is in sight just yet, but I'm definitely sensing it over the horizon. It is however safe to say that this has been the most intense release I've ever worked on, and when I get it out I think I'm going to collapse for a month.

1.8 Latest Update

I've spent the past few days struggling to gracefully remove the legacy refresh code without disrupting things too much, but without success I'm afraid. In the end I had to fairly brutally gut the thing, and I'm now in the process of bringing it back up.

It's looking as though the multi-threaded path is going to have to go. I haven't dug too far into this yet, but it seems as though there is a wild pointer bug being thrown up by it, and I'm getting a 50/50 chance of a hard crash every time I start DirectQ. I'll probably put it down for future investigation in more detail; as I had noted in my original post it was a bit unexpected that it happened as soon as it did.

Things have collapsed to quite a rough and ready state as a result of all this, but I'm finally satisfied that at least the basic structure is moving in the right direction now and that I'll be able to pull it together more easily with the legacy stuff gone. I do need to sort out my data storage though as right now I'm pushing surfaces through a lot of intermediate buffers. Something like doing a batch allocation of all vertexes in a brushmodel pre-sorted by texture and lightmap, then storing indexes in the surf itself sounds about right.

One good thing did happen which was that I successfully reintegrated two separate alias model renderers I had built up - one for batches and the other for individual models. It's nice to have a single path for both types here.

I've also ripped out a lot of my legacy matrix code, and hope to be able to completely drop my matrix classes before too long. They were fine when things were set up the old way, but getting them to integrate with the new way was proving troublesome.

Whoever designed the D3D matrix API must have done so in the same crack den that Carmack came up with the idea for instanced brush models in (although he at least backed away from it afterwards), and probably deserves a good slap in the face for it. The fact that there are no matrix handling API calls in basic D3D at all is quite shocking. D3DX has them, but it uses it's own variant of the matrix struct and you can't mix the two seamlessly (although the data does line up). I'm just going to put them into a union and see what happens.

Wednesday, January 6, 2010

Memory system

The memory system has been reworked now; it still uses almost the exact same code as before, just tidied up a lot.

I've moved the various functions it used over to a C++ class, so that everything can be managed internally in a safer manner. The really cool thing about this is that now anytime I need an expandable memory pool I can just declare it start using it. It will self-destruct when it goes out of scope, and I don't need to go near the heap.cpp or heap.h files to set it up (this was a major nuisance before).

I've changed the output from the "heap_report" command to actually say what the numbers signify, instead of using "highmark" and "lowmark". The old highmark figure was the amount of memory actually committed by DirectQ; what you would have used for your -heapsize. The old lowmark was how much of that is actually currently in use. The old "peak" figure remains and reflects the highest amount of memory that has ever been used for each pool.

Of course there's a lot more than that reserved for use - about 1 GB - but that doesn't impact on either physical memory use or other processes.

The cache and zone systems remain untouched, and I'm happy with the way I have those set up so I don't need to do anything there.

Updates

I've largely cleared out the work-in-progress gunk from both sky and alias models now. A few loose ends left hanging, but otherwise they're fine.

Before I do that I'm going to rewrite my memory system. I've been getting something of a pain in the face from setting up little temporary and working buffers all over the place, so the plan right now is to rework it in a more flexible manner. The limited number of fixed pools has become a limitation, so I'm going to be giving myself the ability to create and release new memory pools on demand and on the fly.

The differences between this and malloc and free - before you ask - is that the pools will be dynamically expandable and all allocations from a given pool will be guaranteed to be in consecutive memory.

They're still going to work largely as they did before, and I'm going to be keeping a few of the new pools around permanently, but overall it's just going to be such a relief to have this freedom.

Tuesday, January 5, 2010

Multithreaded DirectQ

I hadn't intended to do this so soon, but looking through the code in Host_Frame earlier on it struck me that sound was obviously candidate number 1 for multithreading: the S_Update function call from there runs in complete isolation and can be run parallel with the renderer.

A quick read up on critical sections and some experimental code later and DirectQ now has it's own sound thread. The really nice thing about sound is that even the calls from the menus are executed while the main host process has the critical section, so proper isolation was easy to achieve. I can now get rid of S_ExtraUpdate and all the baggage that goes with it, and I have a situation where if the renderer is running slowly it won't cause sound to stutter or break up.

Alias Models are Done

I bit the bullet and did the batch submission of multiple alias models in a single call using vertex indexes thing. Unfortunately it didn't have the full desired effect in ne_tower; there is still one scene where framerates fall a good bit, which is when you're right at the bottom of the tower (after the start passage) looking up and around. Even so, I only drop from 70-odd to 35-40, which is not that bad considering the huge overdraw in this scene, and also considering that I was getting 23 or so a few days back. I'd still like 50 though. (For the record, I've seen FitzQuake drop below 10 with this scene, on the same hardware. Not to knock FitzQuake, but to give an idea of the scale of performance increase I'm getting here).

What I'm really proud of with this map is the work done to get the huge cogs drawing efficiently. These were a bitch, and just looking at them murdered other engines I tested in. I lock at 72. (Huge cheesy grin.)

Of course the firefight at the end of that map is going to be totally silly, but I haven't really gotten round to stress-testing the engine with that yet. What I have done is flown around the general area using noclip/god/notarget, and confirmed that framerates also lock at 72 here.

What's really outstanding now is to go back and begin with sky, then work up through the draw order cleaning out a lot of the experimental/temp code I had written while building this up. There will also be bugs to be fixed and possibly more efficient ways of doing some things to be implemented (especially sky and water). I need to implement r_hlsl 0 water too, then take a good look at particles and see can I do anything better with them. I'll probably be writing about my woes with particles by early next week.

Bringing on the rest of the 1.7 feature set also still needs to be done, as well as some other items not specifically related to the renderer that I've been putting off while working on this. I'll write about those when I come to do them as I have some plans for a few things.

However, right now I'm thinking that I'll probably release the first alpha version incomplete, without those last items. Exactly how incomplete it's going to be remains to be seen, but I'll reveal all when the time comes (translation: I don't know yet).

I've been dangling this in front of your faces for long enough now, so as soon as I'm happy that it's not going to crash on you, and that it can at least do most of what GLQuake does, I'll put it out. Expect news on how soon that's going to be over the next week or two.

As I've said before, this will be followed by the beta, then 1.8.0 final. I have specific intentions for each of these testing stages that I'll also reveal when the time comes.

Phew! Surfaces!

After a few hours of fairly painstaking work I've successfully managed to get all surfaces chained together ordered by texture from front to back and including all bmodels, both inline and instanced, at the end of each texture chain. I've also been able to remove one of those qsorts.

This represents a pretty damn major integration of the surface renderer as I now have everything going through mostly the same path and organised in what is probably the most optimal fashion. The only real performance penalty here would be if a complex bmodel is drawn before a simple world surface (that uses a different texture occurring later in the list of textures) that would occlude it fully, but that requires greater knowledge of the entire structure, including the relationship of every object to every other one.

Enough is enough I say.

To celebrate here's the official one-and-only first ever DirectQ 1.8.0 screenshot.



The only thing that really looks a big deal here is the game name and map name in the title bar (a nice touch I recently added), and even that is nothing overly special. But this is a significant screenshot as it's showing the engine rendering everything normally through the new paths.

The next step is to start removing that huge C++ class I talked about earlier. I'm probably going to end up keeping some fairly small parts of it, just the parts that manage buffer memory, but the state manager and command interpreter are going to die.

This required surfaces to be completed as full removal of it was heavily dependent on being able to chain them properly.

I've kept a copy of it for posterity, and - as threatened - I might even release it sometime. But right now the sooner I can get away from having to look at it ever again, the happier I'll be.

Monday, January 4, 2010

More fun with Alias Models/Speculative stuff

Indexing the alias models has been a substantial improvement and I'm now getting framerates in the 40s in scenes in ne_tower that used to drop into the 20s. Right now I've dropped back to drawing each model individually (just to keep the indexing step simpler), so so the next step is to batch them all up together again. This should see more improvements, and I think at that point I'll have taken alias models as far as I'm going to for this release.

I'm going to need to abandon chained linked lists here as they're just not working out for me; crashes still happen with fundamental stuff like running screenshots. Instead I'll either pop them into a flat array and qsort that, or maybe even just qsort the visedicts list. (I'm getting a bit concerned about how many times I'm using qsort per frame though, but there are other areas I use it where it can be removed: alpha sorting can be done via the BSP tree, for example, and instanced model surfs can be linked in clever ways into the texture chains; I'll wait and see if it becomes a problem).

Before I do any of this, and especially before I press ahead with the remaining items in the render, I need to go back and clean out my handling of sky and solid surfaces. There's quite a huge C++ class managing them at the moment (I might release it just for laughs some day; say in about 2064...) which I had written to simplify adding arbitrary surface types at arbitrary stages of the render setup. Now that I've gotten that out of my system (like I said I would a few days ago) I'm going to simplify the whole setup and tune it for the specific requirements of these surface types. This should see some more improvements, and potentially speeds getting back to close to 1.7 levels for ID1 timedemos.

While tracking down the SCR_UpdateScreen crashes I've made some other small changes. Instead of confronting people with the console when starting up, DirectQ now has a loading screen with a progress meter showing how far into startup it is. The console is still there, and you can still use it exactly the same way as before, but I think it's nicer not to dump new users directly into a command-line interface (yes, there are new Quake users), especially in 2010.

I think I'm going to remove "exec quake.rc" and replace it with default/config/autoexec, then bring us to the main menu at startup. I think this is probably preferable to the Necropolis demo, which everyone is probably sick of by now. Having done that I'll then replace writing config.cfg with writing a "directq.cfg" instead, so that DirectQ will play nicer when living in the same directory as other engines. The directq.cfg will also then be included in the startup order, and config.cfg will only be used if directq.cfg is not found.

This is fairly speculative stuff right now, so I'm interested in opinions and feedback. I'll note however that other popular engines also do the "no demo loop" thing, and use of custom cfg files is standard for QuakeII ports, so precedent is set and my conscience will be clear.

Fun with Alias Models

I can now draw multiple entities sharing the same alias model in a single API call; up to several hundred depending on the complexity of the model. One weird side effect here is that certain calls to SCR_UpdateScreen seem a little aberrant in that they cause an infinite chain of entities linked to each model to be generated. I say "weird" because I'm definitely clearing down pointers both immediately before and immediately after building and drawing the chain. I think what's happening here is that either an entity slot or a model slot that had previously been used is being reused on the client and is getting a stray pointer in the chain pointer. I'm going to tighten up that aspect of it.

An advantage of discovering this bug however is that it's also forcing me to tackle some of the uses of SCR_UpdateScreen in the engine. They were understandable enough for a software renderer, but with a hardware renderer it's sufficient to handle something like scr_drawdialog by just outputting the relevant stuff to the screen and swapping buffers without clearing and without doing a full update.

I'm going to need to index the alias models. Right now the geometry submission is just too high and I'm not seeing the gains I had expected from batching multiple models. Ultimately I'm thinking I might need to put them into a hardware VBO and do vertex blending, but I'm reluctant to go down that route just yet owing to potential lack of hardware support (although D3D software vertex shaders might be an option here).

I have exploratory code written to generate the indexes, and seem to be able to reduce the submission to somewhere between a quarter and a half of what it was, so it should be worthwhile.

More optimizations in

I've addressed one problem with the ne_tower map, which is it's use of instanced brushmodels with huge polycounts for cogs. These were something of an engine choker, so I'm happy to be able to say that I've managed to merge them all together and handle them quite efficiently. Eventually I'm going to merge them into the world render and spit them out from there, which should give another slight boost.

The primary issue with this map however is alias models, which drag it down pretty bad. The next step is to accumulate all entities which share the same model and draw them all in one pass; this should be quite an effective solution which will have benefit for everything else.

On the ctf1rq I'm now locking at a steady 72 FPS (DirectQ reports it as 71 owing to a rounding error in it's counter) while running around the central arena. That one's a wrap.

The big problem with marcher seems to be water; in terms of polycount and complexity it doesn't seem to be as extreme as ne_tower, although I have yet to confirm this. I've already noted that my current implementation of water is slow, so when I start working on that I should see some benefit.

Sunday, January 3, 2010

Basic GLQuake functionality is DONE!

This is a bit of a milestone; the basic GLQuake functionality is now in some ramshackle form of completion and I can play games and run most maps with everything visible on screen and correctly setup.

There is still a way to go. Water/Lava/Slime/Teleports are currently only supported via the HLSL path, so I need to implement surface subdivision and the non-HLSL path. Quite a lot of DirectQ 1.7 features are still missing: underwater warps, shadows and variable lightmap intensities (for different values of r_overbright) spring to mind most immediately.

Before I tackle those there are some performance issues I need to address. I've come across scenes in some maps where framerates drop into the 20s, and this is unacceptable. Switching r_drawentities to 0 resolves it all. Water is also quite slow (by comparison) right now (I just shoved in surfaces without being bothered about ordering), and state changes in the alpha sort path can be a little overwhelming. There are definite fixes for these which I am aware of, but which I had been hoping to put off until at least the following release.

Unfortunately they have become a problem right now, so hey ho, here we go.

DirectQ 1.8 Speeds Explained

OK, this might cause some controversy so here's the deal upfront.

If you are using timedemos of ID1 demos as benchmarks, 1.8 might report that it's running slower than 1.7 was.

I'm not going to sugar the pill, the most likely explanation is going to be because it is. The dropoff won't be too dramatic; in the order of 5% to 10% in an ID1 timedemo. In regular gameplay when you're locked to a maximum of 72 FPS you won't notice a thing.

The reason for this is because - in order to achieve better geometry batching - I need to build up a database of all geometry in the map and other models which is sorted and otherwise organised in a certain fashion. Doing this requires overhead, mainly on the CPU and memory, and in more trivial cases (like ID1 maps) the loss from the overhead may outweigh the gains from the rendering optimizations.

So if you drop from 260 FPS to 235 FPS on "timedemo demo1" don't come running to me hollering about it, OK? ;)

The places where you are going to see huge gains is in maps like ne_tower, marcher, masque and ctfrq1. These are all cases where the older style "classic Quake" renderers get choked, and 1.8 will give them a good liberal dose of WD40 and get them moving smoothly.

You are still going to be able to find some strange places in these maps where 1.8 gets choked.

Much of the optimization comes from being able to efficiently batch together surfaces that share the same texture and lightmap. It's perfectly possible to construct a scene that breaks this, and lots of brushmodels each of which uses many different textures will be one such scenario. These scenes will choke any other engine too, and the worst case is a fallback to a similar level of efficiency (or lack thereof) as 1.7 had.

So likewise if you find such a scene don't come hollering to me about it.

This engine won't be able to make a slow 3D card into a fast one.

If, taking these factors into account, you still find that you're getting sub-20 framerates most of the time, it might just be the case that your 3D card is plain-old-fashioned slow. No amount of clever coding techniques can make a slow card into a fast one, that's hardware baby; all that I can do is help it make better use of what resources it has (and even then I might not succeed all of the time).

For the record, the primary limiting factor in 1.8 is probably going to be bandwidth. Because I'm submitting large batches of triangles at a time, the more bandwidth you have the better you will like it. However, note that even a 4 year old Integrated Intel has sufficient bandwidth to cope with most scenes here so the majority of people should be fine.

Even when I release 1.8 I won't be "finished".

Yes, there will still be more room for optimization. There are 2 or 3 things I am aware of right now that I could do to improve performance, but I'm probably not going to do them for at least the first release of 1.8 as it would mean delays in getting it out.

Saturday, January 2, 2010

DirectQ 1.8 Update

Right now I'm mucking around with the structure of the renderer. Actual drawing, batching and sorting is already a solved problem, although I would probably prefer to have my alias models indexed to take advantage of vertex reuse, but I don't know yet if that's going to make it.

I think I'm on my 6th incarnation at the moment; ideally where I want to be is to have everything stored in a single list that I can read and draw from at any time. There have been some weird implementations, including chained lists (this one just would not die and bits of it persisted across 2 or 3 subsequent incarnations) and a fairly strange interpreted bytecode dialect.

I'd say that a good chunk of the difficulties here involve properly handling sky and water surfaces in instanced brushmodels. I don't even know if many mods actually even use them, but all the same it's something that GLQuake supported so I feel that I should too.

I need to resist the temptation to construct it as a generic scenegraph program, which is all too strong, and which I know is at least some of my motivation for what I'm doing, but it's something I feel that I need to get out of my system all the same. The final product looks like it's going meet somewhere inbetween that and GLQuakes naive "just draw everything as it passes" approach.


Update:

Just been building what I think is the 7th incarnation of the whole rendering structure, and I think I'm pretty happy with what I have now. It's reasonably generic (so far as Quake is concerned) and flexible enough to handle newer map and model features with relatively little pain.

So far I just have brush models (including the world) and alias models plugged into it, but as soon as I include sky (which is next, and needs HLSL support) things should progress quickly enough.

Thinking over it again, another motivation here was the sheer frustration I had implementing alpha brushmodel support on the 1.7 line, and the desire to avoid that again. This time I think I'm gonna be there nicely, so all looks well. I've said before that I loathed the setup of that old renderer, so it's nice to wave it goodbye.
I think that when release time draws near I might do 3 releases of 1.8; alpha, beta and final. There is just such a dramatic upheaval of the renderer after happening that I wouldn't be happy giving people the impression that the first release was in any way guaranteed to be utterly stable.