Skip to content

Texture uploads on Android

September 25, 2013

RAD Game Tools, my employer, recently (version 2.2e) started shipping Bink 2.2 on Android, and we decided it was time to go over the example texture upload code in our SDK and see if there was any changes we should make – the original code was written years ago. So far, the answer seems to be “no”: what we’re doing seems to be about as good as we can expect when sticking to “mainstream” GL ES, reasonably widely-deployed extensions and the parts of Android officially exposed in the NDK. That said, I have done a bunch of performance measurements over the past few days, and they paint an interesting (if somewhat depressing) picture. A lot of people on Twitter seemed to be interested in my initial findings, so I asked my boss if it was okay if I published the “proper” results here and he said yes – hence this post.

Setting

Okay, here’s what we’re measuring: we’re playing back a 1280×720 29.97fps Bink 2 video – namely, an earlier cut of this trailer that we have a very high-quality master version of (it’s one of our standard test videos); I’m sure the exact version of the video we use is on the Internet somewhere too, but I didn’t find it within the 2 minutes of googling, so here goes. We’re only using the first 700 frames of the video to speed up testing (700 frames is enough to get a decent sample).

Like most popular video codecs, Bink 2 produces output data in planar YUV, with the U/V color planes sub-sampled 2x both horizontally and vertically. These three planes get uploaded as 3 separate textures (which together form a “texture set”): one 1280×720 texture for luminance (Y) and two 640×360 textures for chrominance (Cb/Cr). (Bink and Bink 2 also support encoding an alpha channel, which adds another 1280×720 texture to the set). All three textures use the GL_LUMINANCE pixel format by default, with GL_UNSIGNED_BYTE data; that is, one byte per texel. This data is converted to RGB using a simple fragment shader.

Every frame, we upload new data for all 3 textures in a set using glTexSubImage2D from the internal video frame buffers, uploading the entire image (we could track dirty regions fairly easily, but with slow uploads this increases frame rate variance, which is a bad thing). We then draw a single quad using the 3 textures and our fragment shader. All very straightforward stuff.

Furthermore, we actually keep two texture sets around – everything is double-buffered. You will see why this is a good idea despite the increased memory consumption in a second.

A wrinkle

One problem with GL ES targets is that the original GL ES went a bit overboard in removing core GL features. One important feature they removed was GL_UNPACK_ROW_LENGTH – this parameter sets the distance between adjacent rows in a client-specified image, counted in pixels. Why would you care about this? Simple: Say you have a 256×256 texture that you want to update from a system memory copy, but you know that you only changed the lower-left 128×128 pixels. By default, glTexSubImage2D with width = height = 128 will assume that the rows of the source image are 128 pixels wide and densely packed. Thus, to update just a 128×128 pixel region, you would have to either copy the lower left 128×128 pixels of your system memory texture into a smaller array that is densely packed, or call glTexSubImage2D 128 times, uploading a row at a time. Neither of these is very appealing from a performance perspective. But if you have GL_UNPACK_ROW_LENGTH, you can just set it to 256 and upload everything with a single call. Much nicer.

The reason Bink 2 needs this is because we support arbitrary-width videos, but like most video codecs, the actual coding is done in terms of larger units. For example, MPEG 1 through to H.264 all use 16×16-pixel “macroblocks”, and any video that is not a multiple of 16 pixels will get padded out to a multiple-of-16-size internally. Even if you didn’t need the extra data in the codec, you would still want adjacent rows in the plane buffers to be multiples of at least 16 pixels, simply so that every row is 16-byte aligned (an important magic number for a lot of SIMD instruction sets). Bink 2’s equivalent of macroblocks is 32×32 pixels in size, so we internally want rows to be a multiple of 32 pixels wide.

What this all means is that if you decide you really want a 65×65 pixel video, that’s fine, but we’re going to allocate our internal buffers as if it was 96 pixels wide (and 80 pixels tall – we can omit storage for the last 16 rows in the last macroblock). Which is where the unpack row length comes into play – if we have it, we can support “odd-sized” videos efficiently; if we don’t, we have to use the slower fallback, i.e. call glTexSubImage2D for every scan line individually. Luckily, there is the GL_EXT_unpack_subimage GL ES extension that adds this feature back in and is available on most recent devices; but for “odd” sizes on older devices, we’re stuck with uploading a row at a time.

That said, none of this affects our test video, since 1280 pixels width is a multiple of 32; I just though I’d mention it anyway since it’s one of random, non-obvious API compatibility issues you run into. Anyway, back to the subject.

Measuring texture updates

Okay, so here’s what I did: Bink 2 decodes the video on another (or multiple other) threads. Periodically – ideally, 30 times a second – we upload the current frame and draw it to the screen. My test program will never drop any frames; in other words, we may run slower than 30fps, but we will always upload and render all 700 frames, and we will never run faster than 30fps (well, 29.997fps, but close enough).

Around the texture upload, my test program does this:

    // update the GL textures
    clock_t start = clock();
    Update_Bink_textures( &TextureSet, Bink );
    clock_t total = clock() - start;

    upload_stats.record( (float) ( 1000.0 * total / CLOCKS_PER_SEC ) );

where upload_stats is an instance of the RunStatistics class I used in the Optimizing Software Occlusion Culling series. This gives me order statistics, mean and standard deviation for the texture update times, in milliseconds.

I also have several different test variants that I run:

  • GL_LUMINANCE tests upload the texture data as GL_LUMINANCE as explained above. This is the “normal” path.
  • GL_RGBA tests upload the same bytes as a GL_RGBA texture, with all X coordinates (and the texture width) divided by 4. In other words, they transfer the same amount of data (and in fact the same data), just interpreted differently. This was done to check whether RGBA textures enjoy special optimizations in the drivers (spoiler: it seems like they do).
  • use1x1 tests force all glTexSubImage2D calls to upload just 1×1 pixels – in other words, this gives us the cost of API overhead, possible synchronization and texture ghosting while virtually removing any per-pixel costs (such as CPU color space conversion, swizzling, DMA transfers or memory bandwidth).
  • nodraw tests do all of the texture uploading, but then don’t actually draw the quad. This still measures processing time for the texture upload, but since the texture isn’t actually used, no synchronization or ghosting is ever necessary.
  • uploadall uses glTexImage2D instead of glTexSubImage2D to upload the whole texture. In theory, this will guarantee to the driver that all existing texture data is overwritten – so while texture ghosting might still have to allocate memory for a new texture, it won’t have to copy the old contents at least. In practice, it’s not clear if the drivers actually make use of that fact. For obvious reasons, this and use1x1 are mutually exclusive, and I only ran this test on the PowerVR device.

Results

So, without further ado, here’s the results on the 4 devices I tested: (apologies for the tiny font size, but that was the only way to squeeze it into the blog layout)

Device / GPU Format min 25th med 75th max avg sdev
2010 Droid X (PowerVR SGX 530) GL_LUMINANCE 14.190 15.472 17.700 20.233 70.893 19.704 5.955
GL_RGBA 11.139 13.245 14.221 14.832 28.412 14.382 1.830
GL_LUMINANCE use1x1 0.061 38.269 39.398 41.077 93.750 41.905 6.517
GL_RGBA use1x1 0.061 30.761 32.348 32.837 59.906 33.165 4.305
GL_LUMINANCE nodraw 9.979 12.726 13.427 14.985 29.632 13.854 1.788
GL_RGBA nodraw 5.188 10.376 11.291 12.024 26.215 10.864 2.013
GL_LUMINANCE use1x1 nodraw 0.030 0.061 0.061 0.092 0.733 0.086 0.058
GL_RGBA use1x1 nodraw 0.030 0.061 0.061 0.091 0.916 0.082 0.081
GL_LUMINANCE uploadall 13.611 15.106 17.822 19.653 73.944 19.312 6.145
GL_RGBA uploadall 7.171 12.543 13.489 14.282 34.119 13.751 1.854
GL_LUMINANCE uploadall nodraw 9.491 12.756 13.702 14.862 33.966 13.994 2.176
GL_RGBA uploadall nodraw 5.158 9.796 10.956 11.718 22.735 10.465 2.135
2012 Nexus 7 (Nvidia Tegra 3) GL_LUMINANCE 6.659 7.706 8.710 10.627 18.842 9.597 2.745
GL_RGBA 3.278 3.600 4.128 4.906 9.244 4.395 1.011
GL_LUMINANCE use1x1 0.298 0.361 0.421 0.567 1.843 0.468 0.151
GL_RGBA use1x1 0.297 0.354 0.422 0.561 1.687 0.468 0.152
GL_LUMINANCE nodraw 6.690 7.674 8.669 9.815 24.035 9.495 2.929
GL_RGBA nodraw 3.208 3.501 3.973 5.974 12.059 4.737 1.589
GL_LUMINANCE use1x1 nodraw 0.295 0.360 0.413 0.676 1.569 0.520 0.204
GL_RGBA use1x1 nodraw 0.270 0.327 0.404 0.663 1.946 0.506 0.234
2013 Nexus 7 (Qualcomm Adreno 320) GL_LUMINANCE 0.732 0.976 1.190 3.907 22.249 2.383 1.879
GL_RGBA 0.610 0.824 0.977 3.510 13.368 2.163 1.695
GL_LUMINANCE use1x1 0.030 0.061 0.061 0.091 3.143 0.080 0.187
GL_RGBA use1x1 0.030 0.061 0.091 0.092 4.303 0.104 0.248
GL_LUMINANCE nodraw 0.793 1.098 3.570 4.425 25.760 3.001 2.076
GL_RGBA nodraw 0.732 0.916 1.038 3.937 26.370 2.416 2.190
GL_LUMINANCE use1x1 nodraw 0.030 0.061 0.091 0.092 4.181 0.090 0.204
GL_RGBA use1x1 nodraw 0.030 0.061 0.091 0.122 4.272 0.114 0.292
2012 Nexus 10 (ARM Mali T604) GL_LUMINANCE 1.292 2.782 3.590 4.439 16.893 3.656 1.256
GL_RGBA 1.451 2.782 3.432 4.358 8.517 3.551 0.982
GL_LUMINANCE use1x1 0.193 0.284 0.369 0.670 17.598 0.862 2.230
GL_RGBA use1x1 0.100 0.147 0.199 0.313 20.896 0.656 2.349
GL_LUMINANCE nodraw 1.314 2.179 2.320 2.823 10.677 2.548 0.700
GL_RGBA nodraw 1.209 2.101 2.196 2.539 5.008 2.414 0.553
GL_LUMINANCE use1x1 nodraw 0.190 0.294 0.365 0.601 2.113 0.456 0.228
GL_RGBA use1x1 nodraw 0.094 0.119 0.162 0.288 2.771 0.217 0.162

Yes, bunch of raw data, no fancy graphs – not this time. Here’s my observations:

  • GL_RGBA textures are indeed a good deal faster than luminance ones on most devices. However, the ratio is not big enough to make CPU-side color space conversion to RGB (or even just interleaving the planes into an RGBA layout on the CPU side) a win, so there’s not much to do about it.
  • Variability between devices is huge. Hooray for fragmentation.
  • Newer devices tend to have fairly reasonable texture upload times, but there’s still lots of variation.
  • Holy crap does the Droid X show badly in this test – it has both really slow upload times and horrible texture ghosting costs, and that despite us already alternating between a pair of texture sets! I hope that’s a problem that’s been fixed in the meantime, but since I don’t have any newer PowerVR devices here to test with, I can’t be sure.

So, to summarize it in one word: Ugh.

From → Coding, Multimedia

4 Comments
  1. Philip permalink

    If you’re displaying the video in a sufficiently constrained way (overlaying everything else, and with scaling but no rotation), it should be hugely faster to convince SurfaceFlinger to display your images in a new surface and avoid touching OpenGL at all – you can pass the surface a YCbCr_420_SP buffer, and typically the composition hardware will spit it out directly to the display essentially for free.

    If you really need to render it with GL, I’d expect creating a YCbCr surface and passing its buffers to eglCreateImageKHR(… EGL_NATIVE_BUFFER_ANDROID, …) etc should also be reasonably fast, since people building Android devices care about the power usage of camera preview which is basically doing that.

    I’m not certain how much of this is exposed through the NDK or Java API though. And there will be lots of device-specific performance quirks and bugs. The weakness of the hardware and the lack of quality in the drivers is quite a pain when you try to do anything the device wasn’t explicitly optimised for :-(

  2. In one place you say 29.997 fps when you mean 29.97 fps. It’s probably worth correcting that just to avoid confusing readers.

    • IIRC the video was actually tagged as 29.997fps not 29.97fps in the header. People mistyping the frame rate as they render out videos happens more often than you’d think. :)

  3. thx for this benchmark !
    about extensions, sometimes even GL_EXT_unpack_subimage is not defined, but the device supports it.
    exemple GT-I9505 samsung s4 – Adreno (TM) 320

Leave a comment