Inside3D!
     

glTexSubImage2D Performance

 
Post new topic   Reply to topic    Inside3d Forums Forum Index -> OpenGL Programming
View previous topic :: View next topic  
Author Message
mh



Joined: 12 Jan 2008
Posts: 909

PostPosted: Fri Jul 23, 2010 3:24 pm    Post subject: glTexSubImage2D Performance Reply with quote

I've spent days beating on this one in the RMQ engine, and now you get to share the results. Very Happy

Several of the maps have a very heavy dynamic/animated light content, with almost every surface in a scene having some kind of dynamic lighting on it, so the performance of dynamic light updates is utterly critical to overall performance of the engine. Most of the time it was perfectly fine, but there was one machine I tested on where things ground down to 10 FPS.

After trying various workarounds (PBOs, updating per-surface vs bulk updating), I finally did what I should have done in the first place and ground out some code to measure actual timings inside glTexSubImage2D. The results were quite surprising.

GLQuake by default uses internal format GL_RGBA, format GL_RGBA and type GL_UNSIGNED_BYTE. However, the fastest mode in all cases (even on machines that didn't slow down) turned out to actually be GL_RGBA/GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV. This needs OpenGL 1.2 support so I guess 3DFX owners are excluded, but everyone else can join the party.

Measurements:
Code:
mode: 0  320ms [GL_RGBA/GL_UNSIGNED_BYTE] (OK)
mode: 1  317ms [GL_BGRA/GL_UNSIGNED_BYTE] (OK)
mode: 2  377ms [GL_RGBA/GL_UNSIGNED_INT_8_8_8_8] (OK)
mode: 3  375ms [GL_BGRA/GL_UNSIGNED_INT_8_8_8_8] (OK)
mode: 4  376ms [GL_RGBA/GL_UNSIGNED_INT_8_8_8_8_REV] (OK)
mode: 5   12ms [GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV] (OK)

Sample test app (SDL, mostly portable):
Code:

#define WINDOW_WIDTH   800
#define WINDOW_HEIGHT   600

#include "SDL.h"
#include "SDL_opengl.h"

#pragma comment (lib, "SDL.lib")
#pragma comment (lib, "SDLmain.lib")
#pragma comment (lib, "opengl32.lib")

#define TEX_WIDTH 512
#define TEX_HEIGHT 512

unsigned int sibuffer[TEX_WIDTH * TEX_HEIGHT];
unsigned int teximage = 0;
unsigned int framecount = 0;

// find the fastest modes to use for glTexSubImage2D
typedef struct tsitest_s
{
   char formatstr[64];
   char typestr[64];
   GLenum format;
   GLenum type;
   int modespeed;
   bool failed;
} tsitest_t;

tsitest_t tsimodes[] =
{
   {"GL_RGBA", "GL_UNSIGNED_BYTE", GL_RGBA, GL_UNSIGNED_BYTE, 666, true},
   {"GL_BGRA", "GL_UNSIGNED_BYTE", GL_BGRA, GL_UNSIGNED_BYTE, 666, true},
   {"GL_RGBA", "GL_UNSIGNED_INT_8_8_8_8", GL_RGBA, GL_UNSIGNED_INT_8_8_8_8, 666, true},
   {"GL_BGRA", "GL_UNSIGNED_INT_8_8_8_8", GL_BGRA, GL_UNSIGNED_INT_8_8_8_8, 666, true},
   {"GL_RGBA", "GL_UNSIGNED_INT_8_8_8_8_REV", GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, 666, true},
   {"GL_BGRA", "GL_UNSIGNED_INT_8_8_8_8_REV", GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, 666, true}
};


int fastest = 666;
int bestspeed = 32768;

GLuint R_MakeMeATexture (int width, int height, GLenum format, GLenum type)
{
   GLuint texnum = 0;

   glEnable (GL_TEXTURE_2D);
   glGenTextures (1, &texnum);
   glBindTexture (GL_TEXTURE_2D, texnum);
   glTexParameteri (GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
   glTexParameteri (GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
   glTexImage2D (GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, format, type, NULL);

   // commit the buffer so that timings are valid
   glFinish ();

   return texnum;
}


void R_SetTSIMode (void)
{
   int numtsimodes = sizeof (tsimodes) / sizeof (tsitest_t);

   for (int i = 0; i < numtsimodes; i++)
   {
      // clear last the error (if any)
      glGetError ();

      // create a new texture object
      GLuint texnum = R_MakeMeATexture (TEX_WIDTH, TEX_HEIGHT, tsimodes[i].format, tsimodes[i].type);

      Uint32 start = SDL_GetTicks ();

      // SDL_GetTicks has insufficient resolution to measure one call so we need to run a few of them
      for (int t = 0; t < 16; t++)
         glTexSubImage2D (GL_TEXTURE_2D, 0, 0, 0, TEX_WIDTH, TEX_HEIGHT, tsimodes[i].format, tsimodes[i].type, sibuffer);

      Uint32 end = SDL_GetTicks ();

      // commit the buffer so that timings are valid
      glFinish ();

      glDeleteTextures (1, &texnum);

      if (glGetError () != GL_NO_ERROR)
         tsimodes[i].failed = true;
      else tsimodes[i].failed = false;

      tsimodes[i].modespeed = (end - start);
   }

   for (int i = 0; i < numtsimodes; i++)
   {
      printf ("mode: %i %4ims [%s/%s] (%s)\n", i, tsimodes[i].modespeed, tsimodes[i].formatstr,
         tsimodes[i].typestr, tsimodes[i].failed ? "FAILED" : "OK");

      if (tsimodes[i].modespeed <= bestspeed && !tsimodes[i].failed)
      {
         bestspeed = tsimodes[i].modespeed;
         fastest = i;
      }
   }

   if (fastest == 666)
   {
      MessageBox (NULL, "Failed to find a format!", "Error", MB_OK | MB_ICONSTOP);
      exit (0);
   }
}


void RenderOpenGL (void)
{
   framecount++;

   for (int i = 0, w = 0; w < TEX_WIDTH; w++)
   {
      for (int h = 0; h < TEX_HEIGHT; h++, i++)
      {
         unsigned char *rgba = (unsigned char *) &sibuffer[i];

         // 2 == red, 1 == green, 0 == blue
         rgba[2] = ((h * i) + framecount) & 255;
         rgba[1] = ((w * h) + framecount) & 255;
         rgba[0] = ((w * i) + framecount) & 255;
         rgba[3] = 255;
      }
   }

   glClear (GL_COLOR_BUFFER_BIT);

   glViewport (0, 0, WINDOW_WIDTH, WINDOW_HEIGHT);

   glMatrixMode (GL_MODELVIEW);
   glLoadIdentity ();

   glMatrixMode (GL_PROJECTION);
   glLoadIdentity ();
   glOrtho (0, WINDOW_WIDTH, WINDOW_HEIGHT, 0, -99999, 99999);

   glBindTexture (GL_TEXTURE_2D, teximage);
   glTexEnvi (GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);

   glTexSubImage2D (GL_TEXTURE_2D, 0, 0, 0, TEX_WIDTH, TEX_HEIGHT, tsimodes[fastest].format, tsimodes[fastest].type, sibuffer);

   glBegin (GL_QUADS);

   glTexCoord2f (0, 0);
   glVertex2f (0, 0);

   glTexCoord2f (1, 0);
   glVertex2f (TEX_WIDTH, 0);

   glTexCoord2f (1, 1);
   glVertex2f (TEX_WIDTH, TEX_HEIGHT);

   glTexCoord2f (0, 1);
   glVertex2f (0, TEX_HEIGHT);

   glEnd ();
}


int main (int argc, char *argv[])
{
   if (SDL_Init (SDL_INIT_VIDEO | SDL_INIT_NOPARACHUTE) != 0)
   {
      printf ("Unable to initialize SDL: %s\n", SDL_GetError ());
      return 1;
   }

   SDL_GL_SetAttribute (SDL_GL_DOUBLEBUFFER, 1);
   SDL_Surface *screen = SDL_SetVideoMode (WINDOW_WIDTH, WINDOW_HEIGHT, 32, SDL_OPENGL);

   R_SetTSIMode ();
   teximage = R_MakeMeATexture (TEX_WIDTH, TEX_HEIGHT, tsimodes[fastest].format, tsimodes[fastest].type);
   glClearColor (0, 0, 0, 1);

   int done = 0;
   SDL_Event evt;

   while (!done)
   {
      while (!done && SDL_PollEvent (&evt))
      {
         if (evt.type == SDL_QUIT)
         {
            done = 1;
            break;
         }
      }

      // run the screen update here
      RenderOpenGL ();
      SDL_GL_SwapBuffers ();
   }

   return 0;
}

Compile and run, then when you get the timings out entertain yourself with the pretty psychedelic picture.

It's definitely worthwhile including something like this in your video startup code and adjusting your formats and types.
_________________
DirectQ Engine - New release 1.8.666a, 9th August 2010
MHQuake Blog (General)
Direct3D 8 Quake Engines
Back to top
View user's profile Send private message Visit poster's website
mh



Joined: 12 Jan 2008
Posts: 909

PostPosted: Sat Jul 24, 2010 12:37 am    Post subject: Reply with quote

I should add that the timing figures above are for my aberrant driver. Don't expect 30x performance on all hardware, although 2x/3x (lightmap update speed, not overall) is not unreasonable.

Also note that this was time explicitly spent inside glTexSubImage2D, and totally isolated to that function call. I suspect that the driver was pulling the teximage data back to system memory in order to do the update, but we'll probably never know for certain.
_________________
DirectQ Engine - New release 1.8.666a, 9th August 2010
MHQuake Blog (General)
Direct3D 8 Quake Engines
Back to top
View user's profile Send private message Visit poster's website
Spike



Joined: 05 Nov 2004
Posts: 944
Location: UK

PostPosted: Sun Jul 25, 2010 12:14 pm    Post subject: Reply with quote

Quote:

mode: 0 91ms [GL_RGBA/GL_UNSIGNED_BYTE] (OK)
mode: 1 32ms [GL_BGRA/GL_UNSIGNED_BYTE] (OK)
mode: 2 38ms [GL_RGBA/GL_UNSIGNED_INT_8_8_8_8] (OK)
mode: 3 46ms [GL_BGRA/GL_UNSIGNED_INT_8_8_8_8] (OK)
mode: 4 22ms [GL_RGBA/GL_UNSIGNED_INT_8_8_8_8_REV] (OK)
mode: 5 19ms [GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV] (OK)

I guess that's the advantage of not using intel. :P

But yeah, its fairly well known that GPUs favour BGR byte ordering.

I don't understand the big difference between unsigned_byte and int_8_8_8_8_* though.
hmm

Quote:

mode: 0 441ms [GL_RGBA/GL_UNSIGNED_BYTE] (OK)
mode: 1 289ms [GL_RGBA/GL_UNSIGNED_BYTE] (OK)
mode: 2 286ms [GL_BGRA/GL_UNSIGNED_BYTE] (OK)
mode: 3 287ms [GL_RGBA/GL_UNSIGNED_INT_8_8_8_8] (OK)
mode: 4 313ms [GL_BGRA/GL_UNSIGNED_INT_8_8_8_8] (OK)
mode: 5 277ms [GL_RGBA/GL_UNSIGNED_INT_8_8_8_8_REV] (OK)
mode: 6 277ms [GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV] (OK)

new mode :P
also increased the loop count.

Strange that the first upload is so much slower, and even more so when uploaded 256 times.


But yeah, GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV is always the fastest on any pc/ppc hardware, or at least equal to the fastest.
Running it multiple times, there's a fair amount of variation.
_________________
What's a signature?
Back to top
View user's profile Send private message Visit poster's website
mh



Joined: 12 Jan 2008
Posts: 909

PostPosted: Sun Jul 25, 2010 2:04 pm    Post subject: Reply with quote

I'm quite stumped at that one too. I'd never really even heard of the type before and the OpenGL docs aren't exactly shouting about it from the rooftops, but yet it dates back to OpenGL 1.2 so support for it should be pretty much ubiquitous.

A good bit faster than GL_RGBA/GL_UNSIGNED_BYTE even on non-Intel hardware and I guess that the 3ms difference with GL_RGBA/GL_UNSIGNED_INT_8_8_8_8_REV is the overhead of the swizzling.

The first upload being slower would likely be explained by the driver doing a lazy texture bind, so it would be interesting to adjust the code a little to more accurately reflect Quake lightmap uploads, say by using 16 different textures. I'm guessing that the lazy bind would turn out to be most accurately reflective of the Real World under those circumstances.

The only reasonable explanation I can come up with for GL_UNSIGNED_INT_8_8_8_8_REV performance is that it's transferring 32 bits at a time, whereas GL_UNSIGNED_BYTE maybe transfers 8 bits at a time? I would have thought that drivers should show some more intelligence there, recognise an RGBA or BGRA format, and adjust accordingly, but it seems not so.
_________________
DirectQ Engine - New release 1.8.666a, 9th August 2010
MHQuake Blog (General)
Direct3D 8 Quake Engines
Back to top
View user's profile Send private message Visit poster's website
r00k



Joined: 13 Nov 2004
Posts: 483

PostPosted: Sun Jul 25, 2010 10:24 pm    Post subject: Reply with quote

Using BGRA instead I obviously get all blue tinted textures, without manipulating more code, would using RGBA and GL_UNSIGNED_INT_8_8_8_8_REV really be that much slower?
Oddly if I use lit files ON i get > 2000fps in a timerefresh VS. ~ 1300 without. I know timerefresh is just a lazy test, but hey.
Back to top
View user's profile Send private message
Spike



Joined: 05 Nov 2004
Posts: 944
Location: UK

PostPosted: Sun Jul 25, 2010 10:58 pm    Post subject: Reply with quote

for me, RGBA + GL_UNSIGNED_INT_8_8_8_8_REV was nearly the same speed as BGRA (task schedules meant that it could sometimes be faster).
For mh, his drivers suck on everything but those that are made super-easy.
As always, it depends upon your drivers. Its just that BGRA always gives the driver less work. Whether it can get the card to do it all automatically is a different matter.
Depends on your drivers.
_________________
What's a signature?
Back to top
View user's profile Send private message Visit poster's website
mh



Joined: 12 Jan 2008
Posts: 909

PostPosted: Sun Jul 25, 2010 11:50 pm    Post subject: Reply with quote

Just swap R and B in your lightmap builder. Very Happy

Yeah, it depends on drivers. The swizzing operation might be effectively free on some hardware, so you may not have to worry about it, other drivers might be intelligent enough to recognize a 32 bit format and transfer it through a fast path, the particular driver I was testing on, I reckon it was pulling everything back to system memory in order to do the update with most combinations.

The native GPU format will always be the best, all other factors being equal. Testing different formats and picking the fastest at startup time seems quite sensible to me; adjusting the palette and lightmap builder to match isn't really that much overhead.

I wouldn't trust timerefresh as a benchmark, it writes to the frontbuffer (which is eeeeeeeeevil on modern hardware) so there are a whole other set of variables at work there. A good timedemo with lots of flashing lights but not much else in the way of stress on the renderer would be more representative and would isolate the performance difference better. Run it with -nomtex and r_lightmap 1 for the full effect. Try ID1 demo1 or demo3 with entities and particles disabled, for example.

It would be interesting to get timings for this on ATI, or has that already been done? Being designed more around the D3D spec I would expect that they would behave more similarly to Intel (although obviously not in the same league as my shitty driver).
_________________
DirectQ Engine - New release 1.8.666a, 9th August 2010
MHQuake Blog (General)
Direct3D 8 Quake Engines
Back to top
View user's profile Send private message Visit poster's website
Spike



Joined: 05 Nov 2004
Posts: 944
Location: UK

PostPosted: Mon Jul 26, 2010 12:16 am    Post subject: Reply with quote

GL_VENDOR: ATI Technologies Inc.
GL_RENDERER: ATI Radeon HD 3200 Graphics
GL_VERSION: 3.3.9836 Compatibility Profile Context
GL_EXTENSIONS: 164 extensions

tis a laptop. that's my excuse for low FPS, and I'm sticking to it.
As you saw earlier, the swizzling is sufficient that it doesn't really matter.

timerefresh is effective enough, but you do have to bear in mind that its static, non-moving, and has a fixed time (read: no pvs changes so no cache flushes, no lightstyle animations, so at most one lightmap update).

I seriously doubt that uploading as RGBA will ever be faster than BGRA, unless other things on the system interrupt and skew your app's perspective of time. I'd also be surprised if your drivers didn't support it.
_________________
What's a signature?
Back to top
View user's profile Send private message Visit poster's website
mh



Joined: 12 Jan 2008
Posts: 909

PostPosted: Mon Jul 26, 2010 12:35 am    Post subject: Reply with quote

Spike wrote:
I'd also be surprised if your drivers didn't support it.
Agreed. BGRA dates back to OpenGL 1.2 which is 1998 technology. It's also native to D3D (although D3D calls it ARGB, but it's in BGRA order - go figure (hmmmm - 8_8_8_8_REV... Idea )) which doesn't even have a 32-bit RGBA format, so concerns that it's in any way "new", "edgy" or "dangerous" should be out the window. It would be even more reasonable to be susprised that drivers still supported RGBA these days. (Of course OpenGL lets them emulate it in software, as I found out. Yes, I'm sore about it, and no, I don't care how often I say so. Twisted Evil )

Unless one is massively concerned about maintaining 3DFX compatibility there's no reason not to use it.
_________________
DirectQ Engine - New release 1.8.666a, 9th August 2010
MHQuake Blog (General)
Direct3D 8 Quake Engines
Back to top
View user's profile Send private message Visit poster's website
Spike



Joined: 05 Nov 2004
Posts: 944
Location: UK

PostPosted: Mon Jul 26, 2010 12:56 am    Post subject: Reply with quote

I'd be surprised if 3dfx didn't support it. even microsoft support it and they don't support anything!

okay, so the 8_8_8_8_rev thing might not be, but GL_BGRA_EXT... its gonna be supported.
_________________
What's a signature?
Back to top
View user's profile Send private message Visit poster's website
mh



Joined: 12 Jan 2008
Posts: 909

PostPosted: Mon Jul 26, 2010 12:59 am    Post subject: Reply with quote

Spike wrote:
I'd be surprised if 3dfx didn't support it. even microsoft support it and they don't support anything!

okay, so the 8_8_8_8_rev thing might not be, but GL_BGRA_EXT... its gonna be supported.

That's in the 1.2 #defines as well. Maybe not a candidate on 3DFX though, but otherwise it's supported.
_________________
DirectQ Engine - New release 1.8.666a, 9th August 2010
MHQuake Blog (General)
Direct3D 8 Quake Engines
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    Inside3d Forums Forum Index -> OpenGL Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2004 phpBB Group