Texture uploads on Android
RAD Game Tools, my employer, recently (version 2.2e) started shipping Bink 2.2 on Android, and we decided it was time to go over the example texture upload code in our SDK and see if there was any changes we should make – the original code was written years ago. So far, the answer seems to be “no”: what we’re doing seems to be about as good as we can expect when sticking to “mainstream” GL ES, reasonably widely-deployed extensions and the parts of Android officially exposed in the NDK. That said, I have done a bunch of performance measurements over the past few days, and they paint an interesting (if somewhat depressing) picture. A lot of people on Twitter seemed to be interested in my initial findings, so I asked my boss if it was okay if I published the “proper” results here and he said yes – hence this post.
Okay, here’s what we’re measuring: we’re playing back a 1280×720 29.97fps Bink 2 video – namely, an earlier cut of this trailer that we have a very high-quality master version of (it’s one of our standard test videos); I’m sure the exact version of the video we use is on the Internet somewhere too, but I didn’t find it within the 2 minutes of googling, so here goes. We’re only using the first 700 frames of the video to speed up testing (700 frames is enough to get a decent sample).
Like most popular video codecs, Bink 2 produces output data in planar YUV, with the U/V color planes sub-sampled 2x both horizontally and vertically. These three planes get uploaded as 3 separate textures (which together form a “texture set”): one 1280×720 texture for luminance (Y) and two 640×360 textures for chrominance (Cb/Cr). (Bink and Bink 2 also support encoding an alpha channel, which adds another 1280×720 texture to the set). All three textures use the
GL_LUMINANCE pixel format by default, with
GL_UNSIGNED_BYTE data; that is, one byte per texel. This data is converted to RGB using a simple fragment shader.
Every frame, we upload new data for all 3 textures in a set using
glTexSubImage2D from the internal video frame buffers, uploading the entire image (we could track dirty regions fairly easily, but with slow uploads this increases frame rate variance, which is a bad thing). We then draw a single quad using the 3 textures and our fragment shader. All very straightforward stuff.
Furthermore, we actually keep two texture sets around – everything is double-buffered. You will see why this is a good idea despite the increased memory consumption in a second.
One problem with GL ES targets is that the original GL ES went a bit overboard in removing core GL features. One important feature they removed was
GL_UNPACK_ROW_LENGTH – this parameter sets the distance between adjacent rows in a client-specified image, counted in pixels. Why would you care about this? Simple: Say you have a 256×256 texture that you want to update from a system memory copy, but you know that you only changed the lower-left 128×128 pixels. By default,
width = height = 128 will assume that the rows of the source image are 128 pixels wide and densely packed. Thus, to update just a 128×128 pixel region, you would have to either copy the lower left 128×128 pixels of your system memory texture into a smaller array that is densely packed, or call
glTexSubImage2D 128 times, uploading a row at a time. Neither of these is very appealing from a performance perspective. But if you have
GL_UNPACK_ROW_LENGTH, you can just set it to 256 and upload everything with a single call. Much nicer.
The reason Bink 2 needs this is because we support arbitrary-width videos, but like most video codecs, the actual coding is done in terms of larger units. For example, MPEG 1 through to H.264 all use 16×16-pixel “macroblocks”, and any video that is not a multiple of 16 pixels will get padded out to a multiple-of-16-size internally. Even if you didn’t need the extra data in the codec, you would still want adjacent rows in the plane buffers to be multiples of at least 16 pixels, simply so that every row is 16-byte aligned (an important magic number for a lot of SIMD instruction sets). Bink 2′s equivalent of macroblocks is 32×32 pixels in size, so we internally want rows to be a multiple of 32 pixels wide.
What this all means is that if you decide you really want a 65×65 pixel video, that’s fine, but we’re going to allocate our internal buffers as if it was 96 pixels wide (and 80 pixels tall – we can omit storage for the last 16 rows in the last macroblock). Which is where the unpack row length comes into play – if we have it, we can support “odd-sized” videos efficiently; if we don’t, we have to use the slower fallback, i.e. call
glTexSubImage2D for every scan line individually. Luckily, there is the GL_EXT_unpack_subimage GL ES extension that adds this feature back in and is available on most recent devices; but for “odd” sizes on older devices, we’re stuck with uploading a row at a time.
That said, none of this affects our test video, since 1280 pixels width is a multiple of 32; I just though I’d mention it anyway since it’s one of random, non-obvious API compatibility issues you run into. Anyway, back to the subject.
Measuring texture updates
Okay, so here’s what I did: Bink 2 decodes the video on another (or multiple other) threads. Periodically – ideally, 30 times a second – we upload the current frame and draw it to the screen. My test program will never drop any frames; in other words, we may run slower than 30fps, but we will always upload and render all 700 frames, and we will never run faster than 30fps (well, 29.997fps, but close enough).
Around the texture upload, my test program does this:
// update the GL textures clock_t start = clock(); Update_Bink_textures( &TextureSet, Bink ); clock_t total = clock() - start; upload_stats.record( (float) ( 1000.0 * total / CLOCKS_PER_SEC ) );
upload_stats is an instance of the
RunStatistics class I used in the Optimizing Software Occlusion Culling series. This gives me order statistics, mean and standard deviation for the texture update times, in milliseconds.
I also have several different test variants that I run:
GL_LUMINANCEtests upload the texture data as
GL_LUMINANCEas explained above. This is the “normal” path.
GL_RGBAtests upload the same bytes as a
GL_RGBAtexture, with all X coordinates (and the texture width) divided by 4. In other words, they transfer the same amount of data (and in fact the same data), just interpreted differently. This was done to check whether RGBA textures enjoy special optimizations in the drivers (spoiler: it seems like they do).
use1x1tests force all
glTexSubImage2Dcalls to upload just 1×1 pixels – in other words, this gives us the cost of API overhead, possible synchronization and texture ghosting while virtually removing any per-pixel costs (such as CPU color space conversion, swizzling, DMA transfers or memory bandwidth).
nodrawtests do all of the texture uploading, but then don’t actually draw the quad. This still measures processing time for the texture upload, but since the texture isn’t actually used, no synchronization or ghosting is ever necessary.
glTexSubImage2Dto upload the whole texture. In theory, this will guarantee to the driver that all existing texture data is overwritten – so while texture ghosting might still have to allocate memory for a new texture, it won’t have to copy the old contents at least. In practice, it’s not clear if the drivers actually make use of that fact. For obvious reasons, this and
use1x1are mutually exclusive, and I only ran this test on the PowerVR device.
So, without further ado, here’s the results on the 4 devices I tested: (apologies for the tiny font size, but that was the only way to squeeze it into the blog layout)
|Device / GPU||Format||min||25th||med||75th||max||avg||sdev|
|2010 Droid X (PowerVR SGX 530)||GL_LUMINANCE||14.190||15.472||17.700||20.233||70.893||19.704||5.955|
|GL_LUMINANCE use1x1 nodraw||0.030||0.061||0.061||0.092||0.733||0.086||0.058|
|GL_RGBA use1x1 nodraw||0.030||0.061||0.061||0.091||0.916||0.082||0.081|
|GL_LUMINANCE uploadall nodraw||9.491||12.756||13.702||14.862||33.966||13.994||2.176|
|GL_RGBA uploadall nodraw||5.158||9.796||10.956||11.718||22.735||10.465||2.135|
|2012 Nexus 7 (Nvidia Tegra 3)||GL_LUMINANCE||6.659||7.706||8.710||10.627||18.842||9.597||2.745|
|GL_LUMINANCE use1x1 nodraw||0.295||0.360||0.413||0.676||1.569||0.520||0.204|
|GL_RGBA use1x1 nodraw||0.270||0.327||0.404||0.663||1.946||0.506||0.234|
|2013 Nexus 7 (Qualcomm Adreno 320)||GL_LUMINANCE||0.732||0.976||1.190||3.907||22.249||2.383||1.879|
|GL_LUMINANCE use1x1 nodraw||0.030||0.061||0.091||0.092||4.181||0.090||0.204|
|GL_RGBA use1x1 nodraw||0.030||0.061||0.091||0.122||4.272||0.114||0.292|
|2012 Nexus 10 (ARM Mali T604)||GL_LUMINANCE||1.292||2.782||3.590||4.439||16.893||3.656||1.256|
|GL_LUMINANCE use1x1 nodraw||0.190||0.294||0.365||0.601||2.113||0.456||0.228|
|GL_RGBA use1x1 nodraw||0.094||0.119||0.162||0.288||2.771||0.217||0.162|
Yes, bunch of raw data, no fancy graphs – not this time. Here’s my observations:
GL_RGBAtextures are indeed a good deal faster than luminance ones on most devices. However, the ratio is not big enough to make CPU-side color space conversion to RGB (or even just interleaving the planes into an RGBA layout on the CPU side) a win, so there’s not much to do about it.
- Variability between devices is huge. Hooray for fragmentation.
- Newer devices tend to have fairly reasonable texture upload times, but there’s still lots of variation.
- Holy crap does the Droid X show badly in this test – it has both really slow upload times and horrible texture ghosting costs, and that despite us already alternating between a pair of texture sets! I hope that’s a problem that’s been fixed in the meantime, but since I don’t have any newer PowerVR devices here to test with, I can’t be sure.
So, to summarize it in one word: Ugh.