This post is part of the series “Debris: Opening the box“.
Now that the code is released, I’m going to change things up a bit. Originally, I was going to describe Werkkzeug 3’s operator execution engine in isolation; but now there’s enough material to give you the story in the proper order, with all the false starts and backtracking that went into it, which should explain some of the design quirks of the final result. Because it all starts with GenThree, the tool we wrote for Candytron (Note: I originally wrote this text for the fr_public repository).
Introduction
GenThree was the next tool Chaos started after Werkkzeug1. See diary.txt for the early history (if you can read German). I joined the project some time after the last entry in diary.txt, in early January 2003. This marks the first time that Chaos and me collaborated directly; before then, Chaos had used my packers, but we hadn’t shared much code.
After joining, I initially took over the Mesh generator and started work on the “FX Chain” (postprocessing effects), while Chaos focused on the GUI and Scripting parts of the system. After the first 2 months or so it rapidly becomes very hard to say who did exactly what because there was no clear top-level division of labor; we worked quite well together, to the point where we sometimes weren’t certain who had written what. As an aside, if you can read German, concept.txt contains a bunch of notes by Chaos, again mostly from the early phases of the project. Some of these ideas ended up in GenThree (and later Werkkzeug3), some never got implemented, and some we tried but rejected.
Scripting
Anyway; even when I started (this was in the lead-up to Candytron), we were very clearly aiming at producing kkrieger; Candytron was intended to be a kind of milestone. The thing that differentiates GenThree from our previous systems was that it was intended to replace the Operator Graph that had been at the center of our previous tools with a scripting language. The Operator Stacking UI still exists, but outputs script code instead of a blob describing the graph. The intention was to use this UI to build up components, but do all of the high-level glue (and, ideally, a lot of the game logic for kkrieger) as a script.
We were considering several kinds of approaches, and different levels of abstraction – e.g. the top of concept.txt has a detailed sketch of a quite low-level scripting system top that would compile to regular x86 code, plus the associated back-end. This was never implemented; the idea evolved into another scripting system that was intended to run-time compile to regular x86 code. Then we’d store the (compressed) byte code plus a small runtime engine instead of regular x86 code, with the hope that the byte code would be smaller. The language was designed to facilitate easy compilation; it was very FORTH-like, so an initial interpreter could just use explicit stacks and threaded code. This one never happened either, primarily because we dreaded going all-in on something this experimental with a code base that was intended to last till kkrieger at the very least. However, the underlying idea (store all our code in a form that’s more amenable to compression) never went away, and some years later metamorphosed into dispack/disfilter, the x86 code transform used by kkrunchy.
But back to GenThree. Scripting system approach number three, and what you see here, was CSL – Chaos’ Scripting Language (he did all the work on that one). The initial idea is described in concept.txt under “Scripted Demo System”. The runtime system is still very FORTH-like, but it uses three stacks instead of the customary two:
- The I-Stack or integer stack, which contains 16.16 fixed point numbers (there’s no support for floating-point types in script code)
- The R-Stack or return stack, used to implement calls, loops etc.
- The O-stack or object stack, which holds references to complex composite objects like Bitmaps, Meshes and so forth. These objects are typed.
The operation of the I- and R-stacks is only visible to the bytecode interpreter; the language itself is C-like and exposes familiar control structures to the user. The O-stack is different; it’s explicitly manipulated by the code.
Rather than explaining this to you in boring detail, I just recommend you check out system.txt. This defines the set of operators available to every project, and displays some of the unconventional features of the system:
At the beginning, a bunch of classes are defined. These correspond to classes of values on the O-stack. The fields listed define instance variables; I don’t remember how exactly that particular binding worked. Each class also has a magic ID, written as hexadecimal number. Since we use 16.16 fixed point, they’re written in a somewhat funky way here; they’re regular 32-bit ints on the C++ side.
After that, there are some global functions and variables. Note that functions can be assigned IDs, just like classes. This is part of a light-weight binding scheme. By convention, positive IDs are used for script functions that are called by C++ code (so the OnInit, OnFrame and OnSound functions are forward declarations for code that’s supposed to be written in CSL), while negative IDs are script declarations for functions implemented in native C++ code. There’s a table in the C++ code mapping those magic numbers to function pointers – and nothing else. Note that each function has type information on the parameters; this is detailed enough to manually build a stack frame and call into the given C++ function, without needing any layer of glue code or marshalling. This is one of the more unconventional design decisions in CSL; it was done both for convenience and to reduce code size, since mechanical wrapper/glue code is very repetitive and adds zero value. Also note that each function comes with a description of what types of objects are expected to be at the top of the O-stack before execution, and what ends up on the O-stack afterwards. To pick a declaration at random:
void MeshMaterial( int id = 1 [ 1 .. 255 ], int mask = 0 [ "mask8:f" ], ) ( mesh material -- mesh ) = -0x0b,"mod link2",'m';
the ( mesh material -- mesh ) line here denotes that this operator expects a mesh and a material object on the O-stack, and returns a new mesh object. It also means that the op has binding number -0xb (-11), the "mod link2" is a list of annotation tags that can influence GUI, code generation and memory allocation (“mod” here means that the operator can modify an object in-place, for example), and the ‘m’ assigns the hotkey ‘m’ to this operator in the GUI. You get the idea.
There’s even some operators completely written in CSL – cf. the Crashzoom and FXWideBlur filters, both of which make use of the language and call to a single underlying function Blend4x written in C++ that is used to implement various render-to-texture effects.
Going back up a level, I mentioned that there’s still a GUI behind all this. Well, the op-stacking UI is alive and kicking in GenThree, so how does that part work internally? Just look at data/candytron_final_064.csl, the final generated source code for Candytron. The first part of this is just system.txt, which I have just described. After that comes a bunch of generated code corresponding to whatever the user built in the Op-Stacking GUI (this is OnGenerate and friends) and the animation/timeline timeline_OnInit and timeline_OnFrame). All of this is code with a very regular structure that’s intended to compress well. Finally, at the very end there’s the bit starting with // new text – this is actually code entered in the tool itself by the user. The idea was that you could code in there too, but it never saw serious use.
So all this looks pretty nice on paper, right? C++ code providing low-level functions and runtime services, a script engine to tie it all together, and a very decoupled UI that’s loosely coupled: anything that can compile to CSL goes.
The only problem was that (language quirks like the fixed-point centric world view aside) it didn’t work well for what we needed it for. It had reasonable density for code, but as a data representation (and all the Ops are really more data than code) it sucked. Our previous systems had a very compact representation for ops, and lots of tricks to quantize and pack parameter values into a small space. In the byte code-based system, it all boiled down into general “push value” and “call function” op codes, and while the type information was there for execution purposes (to convert the 16.16 ints into floats when necessary, for example), none of that structure was evident or easily exploitable in the byte code. As a result, compression ratios of the byte code sucked – compared to what we were used to, anyway. Candytron had significantly less procedurally generated content than our average intro, yet still spent about 8k (after compression!) on describing it, whereas most of our intros around that time needed maybe 6k for the ops.
On the other side, the script runtime system was fairly big too – much bigger than the more specialized operator execution engines we’d used before (or after). And it just wasn’t pulling its weight. So the summer after we released Candytron, Chaos threw away the scripting parts and most of the existing GUI, but kept the Operators – Werkkzeug3 was born. Still, the base system and most of the content generation parts (and their interface) survived. As one of several side effects, that means that Wz3 used (and still uses) the same “explicit description of stack layout” method for parameters. This has far less benefits in a non-script environment (and we probably wouldn’t have done it if we had been shooting for ops from the beginning), but it’s really just the only part of the original scripting engine that survived.
Code organization
Let’s start at the beginning: _start.cpp (and _startdx.cpp). These two files (plus associated header files) form the OS/rendering abstraction. Yes that’s right, no other file in the project includes any of the Windows or D3D header files, it’s all abstracted away. Fundamentally this is not hard because a demo or intro really doesn’t care that much about what OS it’s running on; what it wants is a window, a nice way to switch state and render triangles, a pipe to output sound to and a way to get current timing information (and maybe keypresses). The tools need a bit more on the input side (proper mouse and keyboard input at least) and some file IO, but it’s still a very limited set of functionality.
The tools use the aforementioned two files; intros use _startintro.cpp, which is a size-optimized mash-up of both _start and _startdx. There’s also _startgdi (for GDI-based GUI-only rendering) and _startgl (GL-based), both of which were never properly completed and don’t work, so there’s not much to talk about.
_start performs initialization and then calls sAppHandler. sAppHandler is implemented in the actual main program and is basically an event handler. To give an example of what constitutes events:
sAPPCODE_CONFIG– display a configuration dialog. (Called before the 3D API is initialized)sAPPCODE_INIT– main initialization phase (after 3D API is initialized).sAPPCODE_EXIT– similarly, shutdown phase.sAPPCODE_KEY– keyboard input.sAPPCODE_PAINT– paint a new frame.
There’s a few more, but you get the idea. There’s two main versions of this: the “tool” runtime in main.cpp and the “player” runtime in mainplayer.cpp. The tool version sets up a window and then runs the GUI event loop; the player version does whatever initialization is necessary and then plays back the demo.
The GUI is a regular retained-mode event loop-based affair. Everything is rendered using the 3D API (D3D in our case) though, and re-rendered every frame, which greatly cuts down on updating bugs :). There’s a lot of UI and UI-related code in GenThree, Werkkzeug3 etc., but that kind of code doesn’t make for very exciting exposition, so I’ll just gloss over it here.
The scripting engine that I’ve talked about is split into two files: cslce.cpp, which implements the scanner, parser and bytecode generator, and cslrt.cpp, the bytecode interpreter and runtime system.
CSL is a language suitable for one-pass compilation with semantic processing and code generation interleaved with code generation. It can be processed with a straightforward Recursive Descent parser. This class of languages has a long tradition and leads to simple, fast (but not very smart) compilers without depending on any special parser generator tools. CSL in particular was greatly inspired by LCC and its code as described in the book “A Retargetable C Compiler: Design and Implementation” – a very worthwhile read if you would like to expand your horizon on a type of Software Engineering that’s underappreciated: Simple, straightforward, very well thought-out no-nonsense C code. (As you might be able to tell, Chaos and me really like that book). Anyway.
The second part, cslrt, implements the runtime. Since the bytecode is stack-based with a small set of built-in operations, this is really quite straightforward.
On the player side, the rest of the source, including all the generators, binds loosely to the script runtime, which calls the shots. The table of script functions, together with some glue code, is in genplayer.cpp. The rest is just a bag of script-callable functions, which I’ll describe one functional group at a time.
Mesh generation
This is something we tried to push hard for this intro – much more so than in our previous releases. The mesh generator contains a bunch of ideas, some
of which worked out well, and lots that didn’t. The core mesh data structures are half-edge based, and the implementation is contained in genmesh.cpp and genmesh.hpp. I’ve written about this in detail before – just look at the mesh-processing articles linked from the parent post.
One part I haven’t talked about before, and that only appears in GenThree, is the Recorder – the Rec* group of functions in the GenMesh class. This was an experimental approach for mesh animation that we used heavily in Candytron. The basic idea was to separate the topological processing (which involves tricky and often slow manipulation of complex data structures) from the vertex processing (which, for a lot of operators, simply computs new vertices as linear combinations of existing ones, which is simple and relatively fast).
So Chaos had the idea to separate the two. The topological processing just runs once, at init time. All the topological modifications are done at that time, but vertex manipulations get recorded to a log. Most of the time, this log is played back immediately and then thrown away – but if the user wants to animate something, he can turn on “proper” recording for the mesh, which means that we actually keep the log for runtime evaluation. Then, every frame, a bunch of input parameters can be changed and the log is played back. Since we only modify vertices not indices or connectivity, we just need to stream the new data into a vertex buffer. This system is far more flexible than regular skinning; several scenes in Candytron first skin the girl mesh (unsubdivided), then subdivide it once, extrude parts of it, and subdivide again, for example. The extrusion and subdivision operators are relatively heavyweight, since they try to deal correctly with hard edges, discontinuities and so forth, but the part that runs at runtime only only does very simple operations on vectors, so it’s quite fast.
While a neat idea, we ultimately killed this one off too. It worked just fine, but in practice just phrasing all our dynamic mesh animation in terms of skinning made things simpler and more orthogonal at the back end, allowed us to do more work in the mesh consolidation/vertex buffer generator phase, and simplified the mesh code a bit (since there was no longer a need to separate topology and geometry processing into two parts and explicitly record every parameter that went into vertex generation). And finally, a single static skinning setup can be baked into a vertex shader for performance; the more complex recorder system, with its variable input-output relationships and different operations done in different sequence, not so much. Note that we still ended up using SW skinning in kkrieger/debris for other reasons (shadow volumes), but the decision to remove the recorder was made long before we commited to shadow volumes.
Finally, there’s Mesh_Babe, which was used to get the girl mesh (her name is “Josie”, by the way) into the intro. On the Editor side, this just loads an exported mesh (.XSI file in this case, since giZMo – the artist on this project – was using XSI). However, the XSI file is much too big to use in a 64k, so we implemented some (then) cutting-edge mesh compression; the papers had only been published a few months prior! The algorithm we ended up using was described in “Near-Optimal Connectivity Encoding of 2-Manifold Polygon Meshes” by Khodakovsky et al.; there was a separate paper “Compressing Polygon Mesh Connectivity with Degree Duality Prediction” by Isenburg that described the same basic idea in a slightly different framework that probably would’ve been easier to implement in GenMesh, but I realized that too late. I’ll spare you the details – suffice to say this was my first (but by no means last) contact with the fact that being able to understand a paper and being able to implement it correctly are two different things, and one of them is much harder than the other :).
Anyway, the idea was that we’d generate rough basic geometry for kkrieger using a conventional modeler and export it. That was in fact the main reason to be implementing a polygon mesh compression scheme that was this general. But in the end we went down a very different road for level building, so this code too ended up being unused anywhere else. (Starting to see a pattern here?)
Texture generation
This code has the distinction of being the least experimental of all the things we tried in Candytron. It also is the only piece of the whole thing that survived into the Debris era without being substantially rewritten or outright replaced. Part of this was the Second System Effect (e.g. GenMesh was clearly overengineered for what we needed), but mostly it was just the result of us going out of our comfort zone established in previous tools and trying to approach things differently. Most of the time it didn’t work out that well, but it wasn’t at all obvious from the outset that this would happen, and it was certainly a learning experience.
Anyway, on to the actual texture generator. Like the RG2 texture generator (but unlike the original fr-08 Generator or Werkkzeug1), GenThree uses an integer format with 16 bits per color channel, to make sure there’s enough precision headroom even after several stages of color adjustment and layer composition. Though we store 16 bits per channel, we actually only use 15 bits – a compromise to navigate the odd set of MMX instructions available at the time: note that PMULHUW was only added with the Pentium 3, and there’s no unsigned version of PMADDWD – dealing with unsigned 16-bit quantities was simply awkward. We also allow ourselves to be slightly sloppy with regards to rounding and such, since we have enough extra bits not to care too deeply about what goes on in the least-significant bits. This makes the code somewhat simpler, though in the grand scheme of things, it probably didn’t matter. Finally, this representation means that a single ARGB pixel fits inside a MMX register, and there’s no need to do any unpacking or packing to do multiplies (MMX only provides 16-bit multiplies).
As you might have noticed by now, all of these decisions were made with an eye towards reasonably fast and simple implementations of the basic operations using MMX. It was all designed long before using GPUs for texture generation was a serious option: when GenThree was written, we were already using D3D9 (brand new at the time!), but we were using the fixed-function pipe – cards with shader support were still quite new and not very wide-spread. And when RG2 (which made many of these original design decisions) was written, there wasn’t any programmable HW in the PC market, period.
So you would expect that there’s lots of MMX code in the texture generator. And you would be completely right. At the time, code generation when using intrinsics was simply dreadful, so it’s mostly inline assembly too. The code itself is not terribly interesting – it does what you’d expect, and this was from before we did any tricky optimizations. It’s probably worth looking at the Blur filter, however; it uses the old (as in, OLD) but still awesome rendering trick of factoring triangle filters and gaussian blurs as iterated box filters: 2x box gives you triangle, and 3x box gives you a uniform B-Spline that is within a few percent of a true Gaussian but much cheaper to compute. The nice bit about this is that box filters are really simple to do fast – for every pixel, you scroll the “window” by one pixel, meaning one pixel on the left “drops out” and one pixel on the right “comes in”, while the rest of the sum stays the same. This is very cheap to perform incrementally, and Bitmap_Blur has a straightforward implementation of the idea. Look closely – by the time you’ll see it next in this repository, namely in Werkkzeug3, it will support non-integer blur kernel widths and be all MMX’ed up and quite hard to follow :) (It’s the subject of its own upcoming blog post; go figure).
Lights, Camera, Action!
Lighting, material, camera and scene description all happen in the same file, genmaterial.cpp. All of these are objects, so they have a description as a data structure, but in GenThree these are quite basic. A scene is a list of meshes with associated transform matrices. A camera is just a matrix with a few extra parameters describing FOV and such. A light is also just a bag of values that gets passed along to D3D. A material both contains D3D “material” parameters (which influence lighting) and the textures and render states used at the time. States are generally collected once at initialization time and compiled into a list (for faster state caching), but other than that it’s pretty simple too. The most important part here is that an explicit material representation that describes *all* the state sent to the API is even there. In a heavily data-driven environment like our tools, that is just the natural way to handle things; that it also happens to be efficient is a nice bonus.
The most interesting part of genmaterial has in fact nothing to do with materials at all; for reasons that I don’t remember, part of the GenMesh implementation is in here, namely the part that actually converts GenMeshes to Vertex/Index buffers. Sometime in the middle of kkrieger development, all of the mesh preparation, lighting/material and multipass management stuff got pulled out into a separate module, engine.cpp, where it’s resided ever since. Of course, when we wrote Candytron, all of this code was much simpler, so there you go.
There’s two interesting bits worth noting here: first, note how we handle vertex/index buffers. GenThree introduced GeoBuffers, which bundle a Vertex Buffer with an associated Index Buffer into a neat little package. This model (in one form or another) has been around with us ever since – you really want to treat them as a unit most of the time. The system level code (in _startdx.cpp) handles all the memory management part of it – pretty sweet. There’s also a special GeoBuffer that has a static index buffer (describing a list of quads) that’s used to render, well, quads. For GUI, particles and such. It makes sense to have this as part of your rendering abstraction; quads are common, and having a simple way to render them just makes sense. Also, if you make them a thing at the system level, it’s trivial to adapt to targets that natively support non-indexed quad lists.
The second part is that the mesh preparation code supports several different modes (or “programs”, as they’re called in the code): There’s MPP_STATIC and MPP_DYNAMIC, which are fairly obvious (allocate to dynamic/static vertex buffer please); more interestingly, there are also the “sprites”, “trees”, “thick lines”, “outlines”, and “finns” modes, which are also extensively used in Candytron. All of these represent different ways to turn a given mesh into a vertex and index buffer. The same mechanism was used in kkrieger to generate input data for shadow volume extrusion (note that while there is a MPP_SHADOW there, that’s a different thing than what we did in kkrieger).
That’s all, folks!
There’s more stuff in there, but this covers what I think are the most interesting bits. If there’s questions or something is unclear, don’t hesitate, just ask!
This post is part of the series “Debris: Opening the box“.
As mentioned in the introduction to this series, our various demo tools contain a lot of cool ideas, but also a lot of bad ones that didn’t work out, and the code has suffered as a result. Nevertheless, the code makes for an interesting historical document if nothing else, and we (Farbrausch), after some deliberation, decided to just open it up – something we’ve been planning to do for ages, but were never able to pull of for various legal reasons. Well, all of those problems have gone away over the past year, so there was no good reason left to not publish the source.
So here it is: https://github.com/farbrausch/fr_public.
We decided to go with very permissive licenses: a lot of the code was released into the public domain, the rest is BSD-licensed, and all of the (currently uploaded) content is released under Creative Commons CC-BY or CC-BY-SA licenses. So if you want to play around with it, go ahead! Also, we will be adding both more demos and code as soon as we’re able – re-licensing these things means we need to acquire permission from all the original authors, which takes a few days to sort out. Similarly, some of the code we’re planning to release contains optional components encumbered with third-party rights that we need to get rid of before we can make it public.
Finally, there’s some ongoing “reconstruction” work, too: the original code was written for various compiler versions etc., and some depended on old library versions that are no longer available. A separate branch of the repository (“vs2010”) contains modified versions of the original code that should compile with VS2010.
And what about this series? Don’t worry, the point I made in the first post of this series remains valid: Source code is a good way of communicating cookbook recipes, but a bad way of describing the underlying ideas. So this series will continue, only from now on, you’ll be able to cross-check against the actual source code and notice all the little white lies I’m telling to make things easier to follow or understand :)
UPDATE: Turns out Memon just decided to release source code for Demopaja, his demo-tool, too! The more the merrier.
This post is part of the series “Debris: Opening the box“.
Here it is, the conclusion to my half-edge based mesh generator miniseries (the data structure part, at least). The first two parts covered the data structures used in GenMesh, the “old” mesh generator. In this part, I’ll be talking about GenMinMesh, the successor to GenMesh which (as they name suggests) was designed to be much more minimal and more suitable for 64ks. As described in the last post, GenMesh tries to build everything out of a small set of primitive operations. The advantage is that the actual data structure manipulation and its associated complexity is concentrated into a few functions. Most of the remaining code defers the data structure management to those few functions; what remains is code to set up and modify vertex positions and attributes – the part that you’ll find in any mesh generator. The problem is that anything that uses the mesh generator at all will end up referencing most of that basic functionality, even if you only ever use (say) one type of primitive. Not a big deal for .kkrieger (which after all used most of the code we had at the time) and Debris (where we didn’t care at the time), but it was an issue for 64ks (not that we seriously pursued any 64k projects after Candytron). Also, GenMesh was fairly slow (mostly because every operation looped over the whole mesh to keep things simple, which meant all the repeated-extrusion stuff hurt on big meshes) and it was painful to deal with operations that resulted in meshes that weren’t necessarily 2-manifold (or had them as temporaries).
Hence GenMinMesh. And to get it out of the way, GenMinMesh is not actually a half-edge based mesh generator. It’s organized around conventional face and vertex structures (slightly more sophisticated than a plain index buffer, but not by much), but knows how to produce half-edge adjacency information for the operations that need it. The data structures are simpler, there’s way less pointer chasing, it’s easy to create a mesh, and there might even be unicorns and rainbows involved! (Not really). So let’s get started.
Back to the basics
We’re going to sacrifice some flexibility and go to a structure with a maximum number of edges per face – allowing us to make the core data structures a lot more straightforward:
static const int KMM_MAXUV = 2; // max number of UVs per vertex
static const int KMM_MAXBONE = 4; // max bones per vertex
static const int KMM_MAXVERT = 8; // max vertices per face
struct GenMinVert // a vertex
{
bool Select; // selected or not
U8 BoneCount; // how many bones in use?
U16 Temp; // mostly padding
Vec3 Pos;
Vec3 Normal;
Vec3 Tangent;
Vec2 UV[KMM_MAXUV];
float Weight[KMM_MAXBONE]; // bone weights
U16 Matrix[KMM_MAXBONE]; // bone matrix index
};
struct GenMinFace // a face
{
bool Select; // selected or not
U8 Count; // edge count
U16 Material; // material index, 0=deleted
Vec3 Normal; // face normal
int Vertices[KMM_MAXVERT]; // vertex indices
int Opposite[KMM_MAXVERT]; // adjacency - see below
};
Note that this time, the vertices don’t just describe connectivity, they have payload; in fact, they’re almost all payload. GenMesh was designed to support variable vertex formats. As a result, there’s basically no geometric information in the vertex structure – there are separate arrays for positions, normals, UV coordinates etc., all of which might are indexed by the vertex number. If the mesh needs a second UV set, allocate space for that array, otherwise it’s just a null pointer. This was flexible, but we never really ended up using that many different vertex formats, so this time we just opted to go for a fixed fat vertex layout that had everything we might conceivably use. This might still get converted into different vertex formats later down in the pipeline (where it’s easy), but the mesh processing code doesn’t have to be aware of it.
Why a maximum of 8 vertices per face? Well, that number clearly has to be at least 3 (triangles), and since we still want Catmull-Clark subdivision it should really be at least 4 (besides, almost all our primitives are mostly made of quads). What about larger faces? For a planar graph (or any other graph with Euler characteristic 2, which happens to include all meshes isomorphic to a sphere, i.e. connected 2-manifolds without holes) where each vertex has a minimum degree of 3 (i.e. it’s not just made of lines), a well-known fact is that the average face must have less than 6 edges (this can be derived from the Euler formula with some algebra), and it can’t have less than 3. So if we make the limit 6 (or higher), we know that higher-degree faces are going to be rare in any mesh we’re likely to encounter – we could either special-case them or make sure they’re never created in the first place. GenMinMesh opts for the latter, for simplicity (the actual limit is 8 because, well, 6 isn’t a power of 2…).
So where’s the half-edges?
One prominent thing I haven’t mentioned so far is the half-edges. Indeed, what I’ve described so far is a classic face-vertex structure with no edge information at all. Don’t worry, the connectivity information is still there, but it’s a lot more implicit than in the previous two posts.
Remember that a half-edge is a directed edge between two vertices that belongs to a single face. Now, in a face-vertex structure, we store indices of all the vertices belonging to a single face, in order. The half-edges making up a face are implied to be between subsequent vertices in that list. In fact, if we specify both a face and a position in the ordered vertex list, we can uniquely identify any given half-edge: the pair (face, slot) denotes the half-edge from face.Vertices[slot] to face.Vertices[(slot + 1) % face.Count] (indices are taken mod face.Count to handle wrap-around). For convenience, let’s use the fact that we limit our polygons to a maximum of 8 edges to pack that face/slot pair into a single integer: e = face_id*8 + slot (the half-edge ID). Now, let’s have a look at the half-edge diagram from the first part one more time, and check how much of the connectivity information we have covered at this point:
- e itself: Check – we can uniquely reference each half-edge in the mesh.
- start(e): Check – this is just
face.Vertices[slot](and as noted in the first part, we only need to store one of the two vertex indices for each half-edge, so we’re good). - face(e): Check – since e is just a face/slot pair, we again can recover this one easily.
- next(e): We just need to move over to the next slot in the same face – again, with wrap-around. So
next(e) = next((face, slot)) = (face, slot+1 % face.Count). - prev(e): This time we decrement instead of incrementing, but otherwise it’s the same.
- opposite(e): Aha, this one we don’t have, and there’s no fast (i.e. non-“search everything”) way to recover it from just the face and vertex lists. And there’s this ominous-looking
Oppositearray in our face structure. The plot thickens. - (Not pictured) faceEdge(f): For a given face, it’s trivial to name a half-edge that belongs to it:
(f, 0)(slot 0 for the given face) should do.
So as you can see, we already have most of the connectivity information we care about; all we need to do is to keep track of the opposite half-edges explicitly, the rest we can just infer from the face-vertex information. As described above, it’s easy to pack a (face, slot) pair into a single integer, which are precisely the integers we store in the Opposite array. Note that, in a sense, this layout is dual to the data structure described in the previous part: there we stored opposite half-edges together but needed explicit links for everything else, here it’s precisely the other way round.
Throw-away connectivity
A problem with the standard half-edge representation is that some basic operations (the most notorious one would probably be adding a single triangle to a mesh) end up being tricky with lots of different cases to consider. The nice thing about the face-vertex structure described above is that faces are still easy to manipulate (in particular, adding new faces is easy). The problem is that operations such as adding a face in the middle of the face list (or a vertex in the middle of a face) will affect multiple edge IDs, effectively breaking existing links.
The way GenMinMesh resolves this (and related) problems is quite simple. Opposite information is not guaranteed to be current; operators might or might not update it. However, there is a single boolean for the whole mesh denoting whether connectivity information is current. All operations that break links clear this flag. Now, if an operation that needs full connectivity (such as mesh extrusion) gets an input mesh that’s not up-to-date, connectivity information gets rebuilt (this boils down to setting all Opposite links correctly).
The assumption is that, when in doubt, the list of faces and the per-face vertex lists are always right. To rebuild connectivity, we first build a list of all half-edges in the mesh in a temporary array. We want to match up opposite pairs; to make this work, we store not just the corresponding half-edge ID but also the IDs of both vertices connected by the half edge, and (arbitrarily) define that the first vertex has to have a lower ID than the second vertex. We then sort the whole list lexicographically by those vertex IDs – any decent sorting algorithm will do. After the sorting process, opposite half-edges will be adjacent in the list, which makes the process of pairing up neighbors straightforward. In pseudocode: (a bit Python-y this time because my usual C-like pseudocode is poorly suited to express data structure iteration concisely)
half_edges = []
for face in (mesh.faces) {
for e in (face.half_edges) {
half_edges.add(edge_desc(id=e.id,
v0=min(start(e), end(e)),
v1=max(start(e), end(e))))
}
}
half_edges.sort(keys=v0, v1)
for i in (1..len(half_edges)) {
prev = half_edges[i-1]
curr = half_edges[i]
if (curr.v0 == prev.v0 && curr.v1 == prev.v1)
connect_halfedges(prev.id, curr.id)
}
Regenerating connectivity information is thus about as expensive as sorting the list of half-edges – pretty cheap on current machines, on the kind of mesh sizes you’re likely to use for real-time rendering anyway. Now of course you could alternate between operations that destroy connectivity information and those that need connectivity information, meaning that this process gets repeated several times for a mesh. If that kind of thing is a serious concern, it’s a better idea to just make sure the connectivity is up to date all the time (i.e. use a regular half-edge representation). But in our case most operations are happy just messing around with the faces or vertices directly, and the simpler data structures greatly help to reduce code complexity and size. An extra advantage is that we don’t have to strictly enforce 2-manifoldness. We can represent meshes that aren’t 2-manifold just fine (we just don’t always answer connectivity queries correctly, but that’s a lot better than rejecting them outright), and it’s also relatively easy to cope with boundaries: boundary edges are just edges that don’t have any matching partner – I use “-1” as opposite half-edge ID to encode this, which is the main reason for my edge IDs being signed.
The discontinuities strike back
Looks pretty reasonable, right? There’s just a small problem: As explained in the introduction, this time we’re using “fat vertices”, and there’s no concept of crease edges, wedges or anything like that. It’s just a list of vertices. And along a seam, we’ll have one vertex one every side. Which means that the procedure above will produce disconnected islands with boundary edges on both sides – definitely not what we want. So how do we fix it?
Again, the solution is low-tech but and a bit limited but practical in the 64k context: we don’t use “raw” vertex IDs when building connectivity information. Instead, we first do a small preprocess: we insert all vertices into a hash table, with the hash code being computed from the position (no fancy geometric hashing – this is using the raw bits). All vertices that are at the exact same position should count as the same vertex for the purpose of generating connectivity. So we compute a simple array that is used to remap vertex IDs prior to building the half-edge list above:
for v in (vertex_ids) {
if (!hash_table.find(v.pos))
hash_table[v.pos] = v.id
remap[v.id] = hash_table[v.pos]
}
// now build "Opposite" array using remap[v.id]
// instead of raw v.id
The same trick can be used for other attributes, too. For example, there’s usually the issue of smoothing groups: you want to assign smooth normals in smooth areas, but still be able to have a hard-edged cube and things like that. If we don’t do anything special, our current code will always err on the side of producing too many hard edges: whenever a vertex is duplicated, we will only incorporate normal information from the faces that directly reference it – even if that vertex is just on an UV seam in a smooth area. But we can use the same “remap” idea here: this time, hash based on the original position and the original (and now stale, but still present!) vertex normal. Each vertex accumulates incident face normal information to (and takes final vertex normal data from) the remapped vertex.
Instead of explicitly keeping track of where discontinuities are supposed to go, as in the previous two articles, we just define a discontinuity for an attribute as “a place where there are currently two different values for the same attribute”. If you want to add a discontinuity – just add a new vertex in the same place and modify the attribute. Want to remove it? Simply make sure all vertices in that location have the same value for whatever attribute you’re interested in, and the discontinuity’s gone. Again, it’s far more implicit than all the explicit bookkeeping we did last time, and it felt a bit weird when we first did it, coming from all the data structure overengineering that happened during GenMesh development (I only described the end result, not the many weird intermediate stages). But it’s simple, it’s not much code, and most importantly of all, it’s not in the way. With GenMinMesh, if you want to generate a mesh, you just do it; there’s no planning involved, and no tricky gymnastics to figure out how to most efficiently generate the topology.
Conclusion
Okay, so in this miniseries, we saw two different mesh generators with two very different underlying data structures and philosophies, embodying opposite ends of the same spectrum. They’re both valid, and both have their pros and cons – and both, I think, are worth describing. In both cases, the design is constrained by the desire to keep the code small, which leads to some trade-offs that make the code simpler but slower. In GenMinMesh, several operations that could easily generate Opposite information on the fly don’t; the code to rebuild it is already going to be there (since the input might have broken adjacency), and rebuilding it is fast enough not to worry about it for the mesh sizes we actually use. Also, both generators have very simple selection mechanisms (essentially a single bool per selectable element) which are compact but lead to inefficient execution patterns, since every operation traverses the whole mesh, even if only 1 face or 5 vertices are selected.
This particular way of augmenting face-vertex structures with adjacency information wasn’t new when I originally implemented it for GenMinMesh, and it’s certainly not now. Directed Edges describes a very similar data structure (but aimed exclusively at triangle meshes), and I remember seeing a version of that idea in Pierre Terdiman’s Mesh Consolidation code years ago, possibly before the “Directed Edges” paper was published. It’s also used by the D3DXMesh code (note all these variants deal with triangle meshes only, though). Despite all this, the technique is not nearly as well-known as it should be. I hope this article helps rectify that a bit.
If there’s any bits that are either missing or weren’t explained properly, feel free to comment. But unless there’s major holes, I expect to close this topic now and move on to other stuff :)
It all started a few days ago with this tweet by Ignacio Castaño – complaining about the speed of the standard library half_to_float in ISPC. Tom then replied that this was hard to do well in code without dedicated HW support – then immediately followed it up with an idea of how you might do it. I love this kind of puzzle (and obviously follow both Ignacio and Tom), so I jumped in and offered to write some code. Tom’s initial approach was a dead end, but it turned out that it was in fact possible to do a pretty decent job using standard SSE2 instructions (available anywhere from Pentium 4 onwards) with no dedicated hardware support at all; that said, dedicated HW support will exist in Intel processors starting from “Ivy Bridge”, due later this year. My solutions involve some fun but not immediately obvious bit-twiddling on floats, so I figured I should take the time to write it up. If you’re impatient and just want the code, knock yourself out. Actually you might want to open that up anyway; I’ll copy over key passages but not all the details. The code was written for x86 machines using Little Endian, but it doesn’t use any particularly esoteric features, so it should be easily adapted for BE architectures or different vector instruction sets.
What’s in a float?
If you don’t know or care how floating-point numbers are represented, this is not a useful article for you to read, because from here on out it will deal almost exclusively with various storage format details. If you don’t know how floats work but would like to learn, a good place to start is Bruce Dawson’s article on the subject. If you do know how floats work in principle but don’t remember the storage details of IEEE formats, make a quick stop at Wikipedia and read up on the layout of single-precision and half-precision floats. Still here? Okay, great, let’s get started then!
Half to float basics
Converting between the different float formats correctly is mostly about making sure you catch all the important cases and map them properly. So let’s make a list of all the different classes a floating point number can fall into:
- Normalized numbers – the ones where the exponent bits are neither all-0 nor all-1. This is the majority of all floats. Single-precision floats have both a larger exponent range and more mantissa bits than half-precision floats, so converting normalized halfs is easy: just add a bunch of 0 bits at the end of the mantissa (a plain left shift on the integer representation) and adjust the exponent accordingly.
- Zero (exponent and mantissa both 0) should map to zero – easy.
- Denormalized numbers (exponent zero, mantissa nonzero). Values that are denormals in half-precision map to regular normalized numbers in single precision, so they need to be renormalized during conversion.
- Infinities (exponent all-ones, mantissa zero) map to infinities.
- Finally, NaNs (exponent all-ones, mantissa nonzero) should map to NaNs. There’s often an extra distinction between different types of NaNs, like quiet vs. signaling NaNs. It’s nice to preserve these semantics when possible, but in principle anything that maps NaNs to NaNs is acceptable.
Finally there’s also the matter of the sign bit. There’s two slightly weird cases here (signed zero and signed NaNs), but actually as far as conversions go you’ll never go wrong if you just preserve the sign bit on conversion, so that’s what we’ll do. If you just work through these cases one by one, you get something like half_to_float_full in the code:
static FP32 half_to_float_full(FP16 h)
{
FP32 o = { 0 };
// From ISPC ref code
if (h.Exponent == 0 && h.Mantissa == 0) // (Signed) zero
o.Sign = h.Sign;
else
{
if (h.Exponent == 0) // Denormal (converts to normalized)
{
// Adjust mantissa so it's normalized (and keep
// track of exponent adjustment)
int e = -1;
uint m = h.Mantissa;
do
{
e++;
m <<= 1;
} while ((m & 0x400) == 0);
o.Mantissa = (m & 0x3ff) << 13;
o.Exponent = 127 - 15 - e;
o.Sign = h.Sign;
}
else if (h.Exponent == 0x1f) // Inf/NaN
{
// NOTE: Both can be handled with same code path
// since we just pass through mantissa bits.
o.Mantissa = h.Mantissa << 13;
o.Exponent = 255;
o.Sign = h.Sign;
}
else // Normalized number
{
o.Mantissa = h.Mantissa << 13;
o.Exponent = 127 - 15 + h.Exponent;
o.Sign = h.Sign;
}
}
return o;
}
Note how this code handles NaNs and infinities with the same code path; despite having a different interpretation, the actual work that needs to happen in both cases is exactly the same.
Dealing with denormals
Clearly, the ugliest part of the whole thing is the handling of denormals. Can’t we do any better than that? Turns out we can. After all, the whole point of denormals is that they are not normalized; in other words, they’re just a scaled integer (fixed-point) representation of a fairly small number. For half-precision floats, they represent Mantissa * 2^(-14). If you’re on one of the architectures with a “convert integer to float” instruction that can scale by an arbitrary power of 2 along the way, you can handle this case with a single instruction. Otherwise, you can either use regular integer→float conversion followed by a multiply to scale the value properly or use a “magic number” based conversion (if you don’t know what that is, check out Chris Hecker’s old GDMag article on the subject). Either way, 0 happens to have all-0 mantissa bits; in short, same as with NaNs and infinities, we can actually funnel zero and denormals through the same code path. This leaves just three cases to take care of: normal, denormal, or NaN/infinity. half_to_float_fast2 is an implementation of this approach:
static FP32 half_to_float_fast2(FP16 h)
{
static const FP32 magic = { 126 << 23 };
FP32 o;
if (h.Exponent == 0) // Zero / Denormal
{
o.u = magic.u + h.Mantissa;
o.f -= magic.f;
}
else
{
o.Mantissa = h.Mantissa << 13;
if (h.Exponent == 0x1f) // Inf/NaN
o.Exponent = 255;
else
o.Exponent = 127 - 15 + h.Exponent;
}
o.Sign = h.Sign;
return o;
}
Variants 3, 4 and 4b all use this same underlying idea; they’re slightly different implementations, but nothing major (and the SSE2 versions are fairly straight translations of the corresponding scalar variants). Variant 5, however, uses a very different approach that reduces the number of distinct cases to handle from three down to two.
A different method with other applications
Both single- and half-precision floats have denormals at the bottom of the exponent range. Other than the exact location of that “bottom”, they work exactly the same way. The idea, then, is to translate denormal halfs into denormal floats, and let the floating-point hardware deal with the rest (provided it supports denormals efficiently, that is). Essentially, all we need to do is to shift the input half by the difference in the amount of mantissa bits (13, as already seen above). This will map half-denormals to float-denormals and normalized halfs to normalized floats. The only problem is that all numbers converted this way will end up too small by a fixed factor that depends on the difference between the exponent biases (in this case, they need to be scaled up by ). That’s easily fixed with a single multiply. This reduces the number of fundamentally different cases from three to two: we still need to dedicate some work to handling infinities and NaNs, but that’s it. The code looks like this:
static FP32 half_to_float_fast5(FP16 h)
{
static const FP32 magic = { (254 - 15) << 23 };
static const FP32 was_infnan = { (127 + 16) << 23 };
FP32 o;
o.u = (h.u & 0x7fff) << 13; // exponent/mantissa bits
o.f *= magic.f; // exponent adjust
if (o.f >= was_infnan.f) // make sure Inf/NaN survive
o.u |= 255 << 23;
o.u |= (h.u & 0x8000) << 16; // sign bit
return o;
}
This is not the fastest way to do the conversion, because it leans on HW to deal with denormals should they arise (something that floating-point HW tends to be quite slow at), but it’s definitely the slickest out of this bunch.
More importantly, unlike the other variants mentioned, this basic approach also works in the opposite direction (converting from floats to halfs), which in turn inspired this code. There’s a bit of extra work to ensure there’s no unintended double rounding and to handle NaNs and overflows correctly, but it’s still the same idea. Which goes to show – sometimes making things as simple as possible really does have rewards beyond the pure intellectual satisfaction. :)
And that’s it. I admit that play-by-play narration of source code isn’t particularly exciting, but in this case I thought the code itself was interesting and short enough to give it a shot. And now back to our regularly scheduled posts :)
This post is part of the series “Debris: Opening the box“.
Welcome back. Now that I’ve covered the underlying ideas behind the mesh data structure used in GenMesh (the older out of two mesh generators used in “Debris”), this post will focus on the practical considerations; it turns out that most mesh operations we need can be expressed in terms of a small number of primitive transformations. But I’ll start with some practicalities of representing our data structure.
Storing connectivity
In the previous post, I’ve explained the abstract data structure, but not the actual implementation. As you hopefully remember, we ended up with a lot of pointers. The actual data structures used in GenMesh are descended from that representation, but slightly different in various ways:
- There’s some extra fields per vertex, edge and face to keep track of user selections; this is important but obvious stuff, so I won’t spend any time describing it here.
- We don’t actually store the Vertex→Half-edge links (outgoing half-edge for every vertex) since they are only used by a small subset of all operations. Those few operations just do an extra pass to extract and cache this information; this ends up being simpler and smaller (though of course not faster) than ensuring those links stay up to date.
- All physical wedges corresponding to the same vertex are kept in a circular linked list, exactly like Tom suggested in his comment to my first article. This is in addition to the logical wedge structure described in the first part. Turns out that either the crease-based or the circular-linked version is fine, but combining both in the same data structure was a bad idea that makes certain operations unnecessarily complicated. I don’t recommend this, but this article describes the code I wrote, not the code I should have written. :)
- Instead of pointers, all vertices, faces and edges are stored in a single dynamic array (aka “vector”) in the mesh. They’re then referenced by their IDs. This has two advantages: first, it’s easy to make a secondary data structure that maps vertices, faces, edges etc. to something; because they all have sequential, dense IDs, a simple array will do. Much simpler (and also smaller and faster) than using an associative data structure keyed off pointers. Second, it turns out to be useful for the way we store edges and half-edges:
- Half-edges are paired with their opposite neighbor and stored as… edges. This also gives us a place to store per-edge information like selection and crease flags. So why is this still a half-edge structure? The crucial part is that unlike e.g. a quad-edge structure, the two half-edges making up such an edge are still directed and individually addressable. The edge with ID
eis made up of the two half-edges with IDse*2ande*2+1, withopposite(e*2) = e*2+1andopposite(e*2+1) = e*2– i.e. the “opposite” operation on a half-edge ID just flips the lowest bit. One pointer per half-edge less to store, one pointer less that needs to be maintained. - Finally, there’s also some
Tempfields for every element that are used as scratch-pads by the various operators. Because you don’t want to have to spin up auxiliary arrays every single time you need to remember a value for a bit.
Anyway, enough talk. Have some code so you can see what it actually looks like: (this is slightly cleaned up, i.e. I skip fields that are still in the code but not actually used in any existing code paths)
struct GenMeshElem // same for all elements of a mesh
{
U8 Mask; // (bitfield) 8 user selection masks/item
bool Select; // selected or not?
};
// Note: I'm going to write "Vert" instead of "physical wedge"
// here and in the following, using the full "Vertex" when
// referring to the topological entity.
struct GenMeshVert: public GenMeshElem // Physical wedge
{
int Next; // Next physical wedge for this vertex (cyclic list)
int First; // First physical wedge for this vertex
int Temp, Temp2;
};
struct GenMeshEdge: public GenMeshElem // An edge.
{
int Next[2]; // next(e) for both half-edges
int Prev[2]; // prev(e)
int Face[2]; // face(e)
int Vert[2]; // start(e)
int Temp[2]; // temp fields - various uses
int Crease; // crease flags
};
struct GenMeshFace : public GenMeshElem // face of a mesh
{
int Material; // material index (0 if deleted face)
int Edge; // one outgoing edge index
int Temp[3]; // some temp fields
};
A mesh, then, holds arrays of Verts, Faces and Edges (plus some meta-data on materials and so forth that I won’t get into here).
As alluded to in the previous article, it’s not so much about directly manipulating this data structure, because that tends to get hairy fast. Instead, most of the mesh processing code either performs read-only accesses on this structure (which is easy) or tries to express all modifications in terms of simple, local changes. In fact, most topological operations supported by GenMesh can be phrased in terms of just two primitives: vertex splits and face splits.
Vertex splits
A vertex split takes an existing vertex and splits it into two, adding a new edge between them. Some of the edges incident to the original vertex get attached to the new vertex, while the rest stay where they are. The operation looks like this:
The blue edges are the edges that get moved to the new vertex, while the black edges stay where they are. The new vertex and edge are drawn in red. The blue edges must be all be consecutive, i.e. there can’t be any black “islands” in the middle of a set of blue edges. This means we can specify which edges are blue just by specifying the first and last one. As you can see, the two faces incident to the new edge gain one new edge each. If you invert such a vertex split, you get an edge collapse, the fundamental building block for a whole class of mesh simplification algorithms (GenMesh does not implement this operation, though).
We can also use the same edge as both first and last one – i.e. there is only one blue edge. This special case effectively inserts another vertex on the blue edge – no need to have a separate operation for this, pretty neat. Here’s how it looks:
But wait, there’s more! You can also have an empty set of blue edges, which means you just add a vertex to a mesh with an edge that connects it to an existing face. This face will contain both half-edges making up the edge – you first walk from the old vertex to the new vertex, and then back again. Weird, I know, and not something you want in a final mesh, but it makes for a very useful intermediate stage sometimes; it’s used by the Catmull-Clark subdivision code, for example. Again, here’s a picture:
I won’t provide any code for this (or for the remaining operations in this article); like modifying doubly-linked lists, it’s mostly an exercise in chasing down all the pointers/indices that need to be modified. You create the new pair of half-edges (one GenMeshEdge) and link them in. The code really isn’t very helpful to read – the diagrams above are much more useful to understand what’s going on. Anyway, on to the second of our building blocks.
Face splits
These do exactly what you would expect given the name, and this time there’s no tricky special cases:
This one adds both a new edge and a new face. Note that even though an edge is conceptually inserted between two vertices (and the diagram shows it that way), what you actually specify is two half-edges from the same face, because the edges are where you actually have to do the pointer manipulations. More importantly, if you do something like the degenerate no-blue-edge vertex split described above, a face can (temporarily) contain the same vertex more than once, and you have to specify which instance. A face can never contain the same half-edge twice (it’s simply impossible to represent in the data structure, since each half-edge only has one “prev” and “next” link), so the corresponding half-edge is the canonical way to identify a vertex within a face.
And that’s it. These two operations are enough to implement almost all algorithms in GenMesh that modify mesh connectivity in a non-trivial way. Which is all well and good, but how do we get a mesh in the first place?
One ring to generate them all… almost
What we need next is some primitives for our artists to start with. Well, this is the part of GenMesh I like the most: We generate almost everything out of a single primitive – “almost” everything because we also support tori, which have topological genus 1 whereas all other standard primitives have genus 0. Both vertex and face splits preserve topological genus, so tori need to be special-cased, but I’ll get to that later. Let’s start with genus-0 primitives, which we all synthesize from the same seed: a “ring” (that’s what it’s called in the code), which is actually a double-sided disc. Here’s an 8-sided specimen, but you can generate them with an arbitrary number of vertices:

An n-sided ring (n > 2) has n vertices and edges and two faces (the front and the back). GenMesh always generates them with the vertices describing a regular polygon in the z=0 plane (radius and rotation are configurable).
So where can we go from here? Well, it turns out that vertex and face splits are enough to implement the topological modifications underlying mesh extrusion (how exactly is the subject of its own article in this series). Starting from a n-sided ring, it’s obvious that we can generate a n-sided cylinder by just extruding along the z-axis. But what else can we do with it? Here’s a list of recipes just using rings and extrusion:
- Cylinder and tessellated cylinder: Extrude ring along z axis; repeat multiple times to get a tessellated cylinder.
- Grid: A 1×1 “grid” is just a ring with 4 sides.
- Cube: A cube is just a cylinder with a 4-sided (1×1 grid) base, really.
- Tessellated cube: Start from a tessellated cylinder with a 4-sided base. That gives you a cube tessellated along the z axis. Now select all faces on the +y side of the cube, and extrude them as often as necessary along the y axis. Rinse and repeat for x-tessellation. Voila, tessellated cube. Let it cool for a bit, then finish by adding creases along the hard edges. :)
- Sphere: You can describe a sphere as a distorted, tessellated cylinder – just start from a cylinder, then move the vertices around after you’re done. Note that this will give a sphere that has a flat face at the top and bottom, instead of single vertices at the poles. If you want a sphere with “proper” poles, see below.
- Cone: A cone is just a warped cylinder that converges to a single point. In other words, start with a tessellated cylinder again, then add one extra vertex for the top, the same way you add the poles for a sphere – again, see below.
As you can see, this covers most of the usual suspects, except for tessellated grids (the extrude trick we pulled for tessellated cubes doesn’t work there, since grids are 2D entities) and the aforementioned tori.
Those pesky poles
We ran into that slight problem above with spheres and cones where we might want to have proper poles or a sharp apex, respectively, instead of just stopping with a flat face. Not to worry – this can be fixed using our existing operations. Ultimately, what we really need to do is add a center vertex in the middle of a ring, and then add a bunch of new triangles connecting the existing sides of the face to that vertex. Using our general vertex and face splits, this is pretty easy:
Start with a degenerate vertex split (no blue edges) to insert the center point and first “spoke” edge. After that, it’s just a matter of splitting the (ever-shrinking) remainder of the original face n-1 times to insert the remaining “spokes”.
The same approach can be used to subdivide faces in general. If you advance in steps of 2 “outside” edges inside of 1 for every step, you tessellate the face using quads, not triangles. It’s pretty nifty.
And as usual, after you have the connectivity right, you can just move the vertices around to wherever you need them.
Tessellated grids
These start out from a regular 1×1 grid (aka 4-ring), but then proceed somewhat differently. Here’s the sequence:

You start out with a quad. You then split the “top” and “bottom” edges – as mentioned above, this is just a vertex split with a single blue edge. Repeat as many times as necessary to get your desired tessellation factor in x. After that, it’s just a bunch of face splits (note you need to do this on both the front and back sides!) to get an x-tessellated grid.
Repeat the same process (“rotated by 90 degrees”, except topologically all that means is you start from a different half-edge) in all of the newly created quads to tessellate the skinny vertical quads along the y axis, and you’re done.
Tori at last
This leaves just one primitive we don’t have yet: the torus. As mentioned above, a torus is genus 1 so it can’t possibly be generated from a ring using our operations, which preserve topological genus. It can, however, be generated from something very similar (in some sense, it’s actually simpler). I’m not gonna draw a picture for this one, because it needs 3D and I’m not gonna embarrass myself trying to do that with my inferior Inkscape skills :).
Anyway, your start mesh for a torus has 1 vertex, 1 face and 2 edges (one “horizontal”, one “vertical”, both connecting the vertex with itself). The face first takes the horizontal edge, then the vertical edge, then the horizontal edge backwards (i.e. opposite half-edge from the first one), then the vertical edge backwards (opposite half-edge from the second one). There we go, closed mesh, all 4 half-edges accounted for, and it really needs toroidal wrapping to be drawn correctly – it’s a torus, what did you expect? Now a 1-vertex torus isn’t exactly a very useful primitive, but you just need that one seed mesh and then do the exact same procedure as for a tessellated grid to get a proper NxM torus. After that, all you need to do is sort out the vertex coordinates :).
Wrap-up
And that’s it! You now know a bit more about the data structures used in GenMesh, and how all the primitives you might care about can be generated from a pretty tiny set of starting meshes using all of 2 topology-modifying operations. That’s probably my favorite part of the whole generator; the original implementation didn’t do it quite that systematically, and I almost went insane trying to debug it before I reduced everything to this small set of ops. It’s surprisingly powerful, and quite elegant too :).
Subdivision and Extrusion, the two main operations GenMesh was trying to support, will be covered in a separate article that I probably won’t get to for a while. Because next up is what you’ve been waiting for: Half-edges redux, the post that describes the mesh data structures used in GenMinMesh, the (somewhat) more minimal mesh generator we wrote leading up to Debris – mainly for 64ks, because GenMesh is kinda big and, as you can imagine from the above description, has lots of interdependencies that make it hard to throw out functionality selectively to save space.
Anyway, I’ll try to finish the next post a bit quicker, though really it all boils down to how quick I get those damn pictures done. Until then!
This post is part of the series “Debris: Opening the box“.
Okay, based on my quick poll here, most people would like me to write a bit about the implementations (there are two) of half-edges used in “debris”, so that’s where I’ll start. My initial plan is for this to be a three-part series (as outlined in the original post). We’ll see how that goes.
But first, let’s start with a tiny bit of history.
Earlier mesh generators
Our earlier 64ks (i.e. the stuff done with Generator or RG2) had very limited tools for mesh manipulation, most of which were a direct result of the underlying data structures: meshes were stored directly as an array of vertices plus an associated array of indices (16-bit vertex indices only, on the HW we were targeting at the time). There was exactly one vertex format used: position (3D) + normal (3D) + UV (2D), all floats, for 32 bytes/vertex. Each mesh had exactly one material (referring, at the time, mainly to the set of associated textures and render states) associated with it. This had a few advantages:
- The code is really simple and small. (Big advantage for 64ks)
- The representation is directly suitable for rendering – just stuff everything into vertex/index buffers and you’re done. Exactly one D3D
DrawIndexedPrimitivecall per mesh. - Generating primitives and simple vertex deformations are easy in this representation.
But it also had a lot of serious problems:
- We were limited to indexed triangle meshes.
- Meshes with more than 65536 vertices couldn’t be correctly represented at all.
- The “one material per mesh” limitation was quite inconvenient for our artists.
- Similarly, “one batch per source mesh” combined with the above limitation made for quite high batch counts (for the time). This was usually resolved in a somewhat ad-hoc fashion by a special “merge” operator that could turn a scene back into a mesh, merging multiple batches into one if they shared the same material.
- Most importantly, anything but the aforementioned primitive generators and simple vertex deformers was inconvenient in this representation, so we just didn’t support it at all.
The original Werkkzeug added support for large meshes (>65536 vertices), more general meshes (faces had to be convex polygons with <=8 edges if I remember correctly), and added Extrusion and Catmull-Clark subdivision. These last two needed adjacency information; this was dealt with in the code in an ad-hoc way and had severe limitations: hard-edged meshes or discontinuous UV mappings didn't work properly.
Interestingly, the mesh representation we eventually ended up preferring (to be described in what will be "Half-edge based mesh representations redux") was very similar to what the original Werkkzeug did, with some crucial modifications. If I had known this at the time, it would've saved me months of my life and a lot of frustration. But I didn't, and so I wrote GenMesh, based on a fairly orthodox link-based half-edge representation. The first two articles in this mini-series will be about this mesh generator.
Design goals
For GenMesh, we wanted the following features:
- The basics we were already used to – primitives and vertex deformations.
- Support for variable vertex formats.
- No artificial limitations for mesh size – anything that fits into memory!
- Multiple materials per mesh.
- Polygon-based meshes (not just triangles) – mainly for Catmull-Clark subdivision surfaces (which was a key feature).
- Proper support for discontinuous attributes.
- Adjacency information is available and up-to-date all the time.
That said, we still intentionally left two limitations: we disallowed meshes with boundary edges (i.e. all our meshes were topologically closed), and we required that all meshes be 2-manifold (which in this case boils down to “no more than two faces may meet at an edge”). Originally, these limitations were chosen because they make the half-edge implementation easier, and we had a small hack to support meshes with boundary from the user perspective: keep an explicit boundary face for every hole, but assign it an invisible material. A few months later, we added Shadow Volumes, which happen to have the same two limitations, and the “bug” became a feature: provided you didn’t use a few disallowed Operators (the ones that generated the “virtual holes”), you knew that your mesh was closed and 2-manifold. And it was instantly obvious where such “topological defects” were introduced into the mesh. Anyone who’s ever tried to guarantee 2-manifoldness for meshes imported from an art package knows how much of a hassle tracking down such problems is when you don’t have that level of control over how your art assets get made.
Basic half-edges
You can find good material on the basics online, so I’ll keep this section fairly brief. Half-edges are among the edge-centric mesh representations. The reason for focusing on edges is quite simple: Face→Edge, Face→Vertex, Vertex→Edge and Vertex→Face are all one-to-many-type relationships in a general polygon mesh, but each edge links exactly 2 vertices and lies (in a closed, 2-manifold mesh) between exactly 2 faces – a fixed number of pointers (or indices), much nicer. Half-edges make this even more uniform by using directed edges (in the direction of winding order). Each half-edge is incident to one face and one vertex, and a pair of half-edges makes up a full edge, hence the name.
So how does this all look? Here’s the obligatory picture:

This shows a random highlighted half-edge e, going from vertex a to vertex b – half-edges are directed, so order matters. The corresponding half-edge in the opposite direction is denoted by opposite(e). Our edge e is part of face f (blue shaded area). The classic half-edge data structure keeps track of:
- Per half-edge
e: Either its start or end vertex index (let’s pickstart(e), the red vertex, here), a link to the faceface(e)it belongs to, links to the previous and next half-edgesprev(e)andnext(e)in the same face (in winding order), and a link to the corresponding opposite paired half-edgeopposite(e). - Per vertex: A link
vertEdge(v)to one of the outgoing half-edges – any will do. (Equivalently, you can keep track of an incoming half-edge; again, this is equivalent) - Per face: A link
faceEdge(f)to any of the half-edges that make up the face.
It’s obvious that this structure lets us go from (half-)edge to (half-)edge since that’s what we store. It’s also obvious that we can enumerate all half-edges making up a face in-order just by following the next (or prev) links. And since we can get from half-edges to vertices, that means we can enumerate all vertices belonging to a face easily too. The one thing that’s not immediately obvious is that we can iterate around vertices, too: define
prevVert(e) := next(opposite(e)) nextVert(e) := opposite(prev(e))
(look at the diagram to see this does the right thing) – note these return a half-edge, not a vertex index. Again, we can use the face link for the corresponding half-edges to enumerate all faces that touch a given vertex.
Half-edge invariants
And that’s the classic half-edge data structure. Doesn’t look so bad, right? Unfortunately, there’s lots of redundancies in this representation because we keep explicit links for everything to make traversal fast. Which in turn means that there’s tons of invariants we have to maintain. Here’s the most important ones:
- For every half-edge
e,e = prev(next(e)) = next(prev(e)) = opposite(opposite(e)), andface(e) = face(next(e)). - For every vertex
v,start(vertEdge(v)) = v. - For every face
f,face(faceEdge(f)) = f.
Note that while these invariants guarantee a connectivity structure that’s free of basic self-contradictions, you need even more than that to get a meaningful polygon mesh. For example, you would probably require that faces contain at least 3 distinct edges. Another seemingly obvious invariant that I won’t require is that start(next(e)) = start(opposite(e)) (two seemingly equivalent expressions for end(e) in the diagram). In fact, I’ll pick the first expression to define end:
end(e) := start(next(e))
and say nothing at all about the second expression for now. I also won’t require for arbitrary edges e that start(e) = start(nextVert(e)). The reasons for all this should become obvious in a minute.
But first, the list above should give you pause. Anyone who’s ever implemented a doubly-linked list knows that even maintaining a single invariant of the type prev(next(e)) = next(prev(e)) above requires careful thinking (and coding) on any operation that updates the data structure. As a rule of thumb, the more invariants need to be maintained by a data structure or algorithm, the trickier it is to implement, and the more likely an implementation is to contain bugs. That’s why doubly-linked lists are harder than singly-linked lists, why binary search is harder than linear search, and why balanced search trees are harder than binary search trees, which are in turn harder to get right than sorted lists/arrays. By that metric, then, the list above suggests we’re in deep trouble. The only way I know to get this kind of data structure manipulation right and stay sane is to build it out of small, local data structure manipulations that can be understood in isolation. The actual set of operations used by the “old” mesh generator in Werkkzeug3 (used for Candytron, .kkrieger, Theta, Debris etc.) is in fact really small (and somewhat limited); I’ll go into more detail in the next post.
As a side note, I’ve only talked about connectivity here; if you plan to interpret this as a polygon mesh, there’s also geometric invariants you’d like to be maintained: distinct vertices are at distinct positions, all faces are planar and convex, and so forth. I’ll completely ignore this type of constraint and the fix-ups necessary to enforce them in this series to concentrate on the data structures.
Attributes and discontinuities
So far, I’ve described the standard half-edge data structure, as you’ll find it in Computational Geometry and Computer Graphics books (and numerous articles on the web). What it doesn’t deal with (so far) is vertices with associated attributes, such as triangle meshes with normals, UV coordinates (possibly multiple sets), vertex colors and so forth. Now, as long as these are all perfectly smooth and interpolated over the mesh, we can still use basic half-edges as discussed above. But as soon as we have sharp discontinuities, we run into an issue. Now discontinuities in normals can usually be hand-waved away by using smoothing groups, and the case you’re most likely to care about in practice are discontinuities in UV coordinates (typically seams created during unwrapping). Nevertheless, I’m going to use vertex colors for the purpose of discussion, because they’re easier to draw in a picture. So here I present, for your viewing pleasure, the Rickety Multicolored Cube™:
The funky pie charts inside our oversized vertices are made of, to borrow terminology from a paper by Hugues Hoppe, “wedges”. A wedge is a set of attributes (normals, UV coordinates etc.) used by at least one face (but possibly more) incident to a vertex. This virtual cube has a different color on every side, which means that each of the 8 corner vertices has 3 associated wedges, one for each side. If I were to draw a tessellated cube with 2×2 quads making up each face, there’d also be vertices in the middle of each cube edge that only have 2 wedges, and vertices at the center of the cube faces that only have one wedge (i.e. no discontinuities at all, which is in fact the case for most vertices in a normal mesh). You’ll have to imagine that picture, since even the simplified, incomplete, skewed version you see above took me ages to draw.
Now, whenever any single attribute (in this case color) varies between any of the faces incident to a vertex, we have multiple wedges. However, just because one attribute is discontinuous doesn’t mean that all of them are; for example, you might have discontinuous normals across an edge but still want to interpolate UV coordinates smoothly. To resolve this case correctly, Werkkzeug3 stores what I call “crease flags” for each edge (yes, per edge not half-edge). This is just a single bit per attribute to denote whether this attribute has a discontinuity across the given edge or not. To visualize this, imagine that each attribute has its own set of wedges (I’ll call these “logical wedges”, and they depend on the attribute; “physical wedges”, the things we actually store, do not). The boundaries in the “pie chart” are exactly along the edges where the corresponding crease flag is set. Now, what we actually store is just the physical wedges. Whenever there’s a discontinuity in any of the attributes (i.e. whenever we cross an edge with non-zero crease flags), we store a new set of vertex attributes, but only some of them (namely the ones with their crease flags set) are allowed to change across that edge. I hope that makes sense.
Dealing with wedges
Hoppe maintains the distinction between wedges and vertices; vertices just store their position whereas wedges hold all interpolated attributes. As mentioned above, things get murky in the general case, where the set of seams (what I call “crease edges”) is potentially different for every single attribute, and I try to store just one set of wedges (with all “cuts” folded in) plus meta-information of which discontinuities are where. Now if I modify an attribute at a vertex that’s not discontinuous, I might need to update multiple copies – one per physical wedge. What this means is that in general, you can’t change the value of any attribute “at a vertex” – you need to know which physical wedges need updating. It also means that you can’t (in general) specify which value of an attribute you mean just by specifying the vertex: there might be multiple values at that vertex. However, while the vertex alone isn’t distinctive enough, a (vertex, face) pair is – or, more concisely, the half-edge corresponding to that (vertex, face) pair. So all vertex update operations actually need to take a half-edge reference as input, not just a vertex ID. A vertex attribute update that respects the current set of discontinuities looks like this:
Input: Seed edge e, attribute index a to update, new value v.
- Find start of logical wedge. Sweep
eclockwise around its start vertex until you hit an edge with crease flag foraset or you’re back at the beginning. - Find end of logical wedge. Now sweep
ecounterclockwise from the start of the logical wedge found in the previous step, again until you hit a crease foraor you’re back at the beginning. - Update logical wedge. Loop over all physical wedges between the two extremes (if the two are equal, this reduces to just “all physical wedges”) and set their value for attribute
atov.
Yes, that’s a lot of work to change a single value. What can I say, there’s a reason we later rewrote the whole thing.
Anyway, dealing with multiple sets of attributes means we need to deal with the case where only some of them are discontinuous anyway (hence the algorithm above). Which means there’s really no big win from keeping positions and attributes separate in the data structure; we can just treat position as an attribute that we always require to have no discontinuities (and in fact we could also theoretically represent boundary edges as “position discontinuities”, though I haven’t tried this).
So here’s the punchline: we don’t separate vertices and wedges. We just store an array of (physical) wedges, which also happens to be exactly what needs to go into the vertex buffer for rendering (eventually). Instead of a vertex ID, we just store the physical wedge ID as start(e) in the corresponding half-edges. When looping around a vertex using prevVertEdge and nextVertEdge, we end up iterating over all wedges for a vertex, so we might end up seeing multiple values of start(e) (that’s why I didn’t require consistent vertex numbers for this case). This should also explain why I defined end(e) to be start(next(e)) – the alternative choice of start(opposite(e)) will still return something at the same position, but potentially corresponding to a different wedge.
Anyway, enough for this post – I’ll let the whole “wedge” thing sink in for a while. I realize that describing this in text is terrible, but frankly, if I have to spend more time futzing around with vector drawing tools this post is never going to happen, out of sheer frustration. Besides, if you want to actually grok this, there’s just no way around filling up a lot of paper with your own diagrams; at least, that’s the way it went for me when reading papers about mesh data structures. Until next time!
It’s now been almost 5 years since we (farbrausch) released fr-041: debris, and almost 8 years since the somewhat ill-fated .kkrieger. We tried to package that technology up and sell it for a while, which I wasn’t particularly happy with when we decided to start doing it and extremely unhappy with by the time we stopped :). We decided to dissolve the company that owned the IP (.theprodukkt GmbH) about two years ago, a process that was finished last summer. Some months ago we got the tax returns for our last fiscal year; I missed the party on account of currently being on another continent from my other co-founders.
So now both the tech and the code are somewhat orphaned. We could just release the source code into the public, but frankly it’s an unholy mess, and you’re likely to miss all the cool bits among the numerous warts. I’ve done source releases before, and we might still do it with Werkkzeug3 (the tool/framework behind the two aforementioned productions and nineteen others). But I’d really rather present it in a somewhat more curated form, where I highlight the interesting bits and get to sweep all the mess behind it under the rug. So here’s the idea: this post contains a list of things in Debris that I think might make for an interesting post. Unlike my “Trip Through The Graphics Pipeline 2011” series, this list is far too long to just decide on a whim to do all of it. So instead, you get to vote: if there’s anything on this list that you’d like me to write about, post a comment. If there’s sufficient interest and I’m in the mood, I’ll write a post about it. :)
The one thing I’m not going to talk about is about our tool-centric way of producing demos. Chaos and me have talked about that several times and I’m getting bored of the topic (I think there’s videos on the net, but didn’t find anything on my first Google-sweep; will re-check later). Also, some of the topics have dependencies among each other, so I might decide to post something on the basics first before I get into specifics. Just so you’re warned. Anyway, here’s the list:
Source code
Basics / execution environment
- Operators (the building block of our procedural system)
- Writing small code on Win32/x86 (in C++)
- Executable compression
- Memory Management on the editor side
- The Operator Execution Engine (demo/player side)
Texture generation
- How to generate cellular textures part 1 and part 2.
- Fast blurs part 1 and part 2.
- Fast Perlin noise-based texture generation
- Shitting bricks
Compression
- x86 code compression in kkrunchy
- Squeezing operator data
Animation
- Operator “initialization” versus “execution” (animatable parameters)
- Animation scripts
- The timeline
Mesh Generation
- Half-edge based mesh representations: theory
- Half-edge based mesh representations: practice
- Half-edges redux
- Extrusions
- Catmull-Clark subdivision surfaces
- Bones and Bends
- 3D Text
Sound
- Debris and kkrieger use Tammo “kb” Hinrichs’s V2 synth, described on his blog: part I, part II, part III, and part IV.
Effects
- Swarms of cubes
- Exploding geometry
- Image postprocessing
Shaders and 3D engine
- Shadergen (directly generates D3D9 bytecode for ubershaders, code here)
- The old material/lighting system: .kkrieger, PS1.3, cubes
- The new material/lighting system: PS2.0, the city, multipass madness
- Basic engine architecture – lights, layers and passes
- Converting generated meshes to rendering-ready meshes
- Skinning and Shadow Volumes
Other
- Metaprogramming for madmen (a small story about .kkrieger)
That’s not all of it, but it’s all I can think of right now that is a) somewhat interesting (to me that is) and b) something I can write about – for example, I wouldn’t be comfortable writing about the music/sound or overall flow/direction of the demo, since frankly I had little to do with that. And of course I didn’t write the code for all of the above either :), but I do know most of the code well enough to do a decent job describing it. That said, if you think there’s something I missed, just ping me and I’ll put it on the list.
So there you go. Your turn!
Row-major vs. column-major is just a storage order thing and doesn’t have anything to do with what kind of vectors you use. But graphics programmers tend to be exposed to either GL (which uses column-major storage and column vectors) or D3D (which used row-major storage and row vectors in its fixed function pipeline, and still uses that convention in its examples), so they think the two things are tied together somehow. They’re not. So let’s disentangle them! But before we start, a small apology: In this article I’ll be using both “formula” and “code” notation for variable names, which is a typographic faux pas but necessary because WordPress isn’t very good at lining up formulas with body text and I don’t want to make this unnecessarily hard to read.
Row/column major
There’s no disagreement about how plain arrays are stored in memory: Any programming language that supports a plain (not associative) array type just stores the elements sequentially. Once a language supports multidimensional arrays however, it needs to decide how to squeeze the 2D arrangement of data into a 1D arrangement in memory, typically as an 1D array. One classical use case for multidimensional arrays are matrices:
Given a Matrix , there’s two “canonical” ways to store it in memory (discounting fancier storage schemes such as tiled or Morton-order storage):
- Row-major storage traverses the matrix by rows, then within each row enumerates the columns. So our
Awould be stored in memory asa11, a12, a13, a21, a22, a23. This matches reading order in English and most other Western languages. Row-major storage order for 2D arrays is used by C / C++, the Borland dialects of Pascal, and pretty much any language that was designed since the mid-1970s and supports multidimensional arrays directly. (Several newer languages such as Java don’t support 2D arrays directly at all, opting instead to represent them as an 1D array of references to 1D arrays). It’s also the dominant storage order for images. The position of the element at rowi, columnjin the underlying 1D array is computed asi*stride + j, wherestrideis the number of elements stored per row, usually the width of the 2D array, but it can also be larger. - Column-major storage traverses the matrix by columns, then enumerates the rows within each column.
Awould be stored in memory asa11, a21, a12, a22, a13, a23. Column-major storage is used by FORTRAN and hence by BLAS and LAPACK, the bread and butter of high-performance numerical computation. It is also used by GL for matrices (due to something of a historical accident), even though GL is a C-based library and all other types of 2D or higher-dimensional data in GL (2D/3D textures etc.) use row-major.
Let me reiterate: Both row and column major are about the way you layout 2D (3D, …) arrays in memory, which uses 1D addresses. In the example (and also in the following), I always write the row index first, followed by the column index, i.e. is the element of
A in row i and column j, no matter which way it ends up being stored. By the same convention, I write “3×4 matrix” to denote a matrix with 3 rows and 4 columns, independent of memory layout.
Matrix multiplication
The next ingredient we need is matrix multiplication. If A and B are a and a
matrix, respectively, their product
C=AB is a matrix – note the middle dimension has to match between the two. The element in row
i and column j of matrix C is computed as the dot product of the i-th row of A and the j-th column of B, or in formulas . All this is standard stuff.
The important point to note here is that for a matrix product AB you always compute the dot product between rows of A and columns of B; there’s no disagreement here. This is how everyone does it, row-major layout or not.
Row and column vectors
So here’s where the confusion starts: a n-element vector can be interpreted as a matrix (a “typecast” if you will). Again, there’s two canonical ways to do this: either we stack the n numbers vertically into a matrix, which gives a column vector, or we arrange them horizontally, producing a
matrix or row vector. Note that this has nothing to do with the storage layout; in both row and column major orders, row and column vectors are simply stored as a
n-element 1D array.
Here’s where the confusion starts: To transform a vector by a matrix, you first need to convert the vector to a matrix (i.e. choose whether it’s supposed to be a column or row vector this time), then multiply the two. In the usual graphics setting, we have a 4×4 matrix T and a 4-vector v. Since the middle dimension in a matrix product must match, we can’t do this arbitrarily. If we write v as a 4×1 matrix (column vector), we can compute Tv but not vT. If v is instead a 1×4 matrix (row vector), only vT works and Tv leads to a “dimension mismatch”. For both column and row vectors, the result of that is again a column or row vector, respectively, which has repercussions: if we want to transform the result again by another matrix, again there’s only one place where we can legally put it. For column vectors, transforming a vector by T then S leads to the expression , so the matrix that represents “first
T then S” is given by the matrix ST. For row vectors, we get , so the matrix for the concatenation of the two is given by
TS. Small difference, big repercussions: whether you choose row or column vectors influences how you concatenate transforms.
GL fixed function is column-major and uses column vectors, whereas D3D fixed function was row-major and used row vectors. This has led a lot of people to believe that your storage order must always match the way you interpret vectors. It doesn’t. Another myth is that matrix multiplication order depends on whether you’re using row-major or column-major. It doesn’t: matrices are always multiplied the same way, and what that product matrix means depends on what type of vector you use, not what kind of storage scheme.
One more point: say you have two n-element vectors, u and v. There’s two ways to multiply these two together as a matrix product (yes, you can multiply two vectors): if u is a row vector and v is a column vector, the product is a 1×1 matrix, or just a single scalar, the dot product of the two vectors. This operation is the (matrix) inner product. If instead u is a column vector and v is a row vector, the result is a matrix, or outer product (actually the outer product can also be computed when the dimensions of
u and v don’t match, but I ignore that case here). The outer product isn’t as common as its more famous cousin, but it does crop up in several places in Linear Algebra and is worth knowing about. I bring these two examples up to illustrate that you can’t just hand-wave away the issue of vector orientation entirely; you need both, and which one is which matters when the two interact.
With that cleared up, two comments: First, I recommend that you stick with whatever storage order is the default in your language/environment (principle of least surprise and all that), but it really doesn’t matter whether you pick row or column vectors – not in any deep sense, anyway. The difference between the two is purely one of notation (and convention). Second, there happens to be a very much dominant convention in maths, physics and really all the rest of the world except for Computer Graphics, and that convention is to use column vectors by default. For whatever reason, some of the very earliest CG texts chose to break with tradition and used row vectors, and ever since CG has been cursed with a completely unnecessary confusion about things such as matrix multiplication order. If you get to choose, I suggest you prefer column vectors. But whatever you do, make sure to document whatever you use in your code and stick with it, at least within individual components. In terms of understanding and debuggability, a single source file that uses two different vector math libraries with opposite conventions (or three – I’ve seen it happen) is a catastrophe. It’s a bug honeypot. Just say no, and fix it when you see it.
This is a small but nifty solution I discovered while working on IO code for Iggy. I don’t claim to have invented it – as usual in CS, this was probably first published in the 60s or 70s – but I definitely never ran across it in this form before, so it seems worth writing up. Small disclaimer up front: First, this describes a combination of techniques that are somewhat orthogonal, but they nicely complement each other, so I describe them at the same time. Second, it’s definitely not the best approach in every scenario, but it’s nifty and definitely worth knowing about. I’ve found it particularly useful when dealing with compressed data.
Setting
A lot of code needs to be able to accept data from different sources. Sometimes you want to load data from a file; sometimes it’s already in memory, or it’s a memory-mapped file (boils down to the same thing). If it comes from a file, you may have one file per item you want to load, or pack a bunch of related data into a “bundle” file. Finally, data may be compressed or encrypted.
If possible, the simplest solution is to just write your code to assume one such method and stick with it – the easiest usually being the “already in memory” option. This is fine, but can get problematic depending on the file format: for example, the SWF (Flash) files that Iggy loads are usually Deflate-compressed as a whole, and may contain chunks of image data that are themselves JPEG- or Deflate-compressed. Finally, Iggy wants all the data in its own format, not in the Flash file format, since even uncompressed Flash files involve a lot of bit-packing and implicit information that makes them awkward to deal with directly.
Ultimately, all we want to have in memory is our own data structures; but an implementation that expects all data to be decompressed to memory first will need a lot of extra memory (temporarily at least), which is problematic on memory-constrained platforms.
The standard solution is to get rid of the memory-centric view completely and instead implement a stream abstraction. For read-only IO (which is what I’ll be talking about here), a simple but typical (C++) implementation looks like this:
class SimpleStream {
public:
virtual ErrorCode Read(U8 *buffer, U32 numBytes,
U32 *bytesActuallyRead) = 0;
};
Code that needs to read data takes a Stream argument and calls Read when necessary to fill its input buffers. You’ll find designs like this in most C++ libraries that handle at least some degree of IO.
Simple, right?
The problems
Actually, there’s multiple problems with this approach.
For example, consider the case where the file is already loaded into memory (or memory-mapped). The app expects data to land in buffer, so our Read implementation will end up having to do a memcpy or something similar. That’s unlikely to be a performance problem if it happens only once, but it’s certainly not pretty, and if you have multiple streams nested inside each other, the overhead keeps adding up.
Somewhat more subtly, there’s the business with numBytes and bytesActuallyRead. We request some number of bytes, but may get a different number of bytes back (anything between 0 and numBytes inclusive). So client code has to be able to deal with this. In theory, “stream producer” code can use this to make things simpler; to give an example, lots of compressed formats are internally subdivided into chunks (with sizes typically somewhere between 32KB and 1MB), and it’s convenient to write the code such that a single Read will never cross a chunk boundary. In practice, a lot of the “stream consumer” code ends up being written using memory or uncompressed file streams as input, and will implicitly assume that the only cases where “partial” blocks are returned are “end of file” or error conditions (because that’s all they ever see while developing the code). This behavior may be technically wrong, but it’s common, and the actual cost (slightly more complexity in the producer code) is low enough to not usually be worth making a stand about.
So we have an API where we frequently end up having to check for and deal with what’s basically the same boundary condition on both sides of the fence. That’s a bad sign.
Finally, there’s the issue of error handling. If we’re loading from memory, there’s basically only one error condition that can happen: unexpected end, when we’ve exhausted the input buffer even though we want to read more. But for a general stream, there’s all kinds of errors that can happen: end of file, yes, but also read errors from disk, timeouts when reading from a network file system, integrity-check errors, errors decompressing or decrypting a stream, and so on. Ideally, every single caller should check error codes (and perform error handling if necessary) all the time. Again, in practice, this is rarely done everywhere, and what error-handling code is there is often poorly tested.
The problem is that checking for and reacting to error conditions everywhere is inconvenient. One solution is using exception handling; but this is, at least among game developers, a somewhat controversial C++ feature. In the case of Iggy, which is written in C, it’s not even an option (okay, there is setjmp / longjmp, which we even use in some cases – though not for IO – but let’s not go there). A different option is to write the code such that error checking isn’t required after every Read call; if possible, this is usually much nicer to use.
A different approach
Without further ado, here’s a different stream abstraction:
struct BufferedStream {
const U8 *start; // Start of buffer.
const U8 *end; // (One past) end of buffer
// i.e. buffer_size = end-start.
const U8 *cursor; // Cursor in buffer.
// Invariant: start <= cursor <= end.
ErrorCode error; // Initialized to NoError.
ErrorCode (*Refill)(BufferedStream *s);
// Pre-condition: cursor == end (buffer fully consumed).
// Post-condition: start == cursor < end.
// Always returns "error".
};
There’s 3 big conceptual differences here:
- The buffer is owned and managed by the producer – the consumer does not get to specify where data ends up.
- The
Readfunction has been replaced with aRefillfunction, which reads in an unspecified amount of bytes – the one guarantee here is that at least one more byte will be returned, even in case of error; more about that later. Also note it’s a function pointer, not a virtual function. Again, there’s a good reason for that. - Instead of reporting an error code on every
Read, there’s a persistenterrorvariable – thinkerrno(but local to the given stream, which avoids the serious problems that hamper the ‘real’errno).
Note that this abstraction is both more general and more specific than the stream abstraction above: more general in the sense that it’s easy to implement SimpleStream::Read on top of the BufferedStream abstraction (I’ll get to that in a second), and more specific in the sense that all BufferedStream implementations are, by necessity, buffered (it might just be a one-byte-buffer, but it’s there). It’s also lower-level; the buffer isn’t encapsulated, it’s visible to client code. That turns out to be fairly important. Instead of elaborating, I’m gonna present a series of examples that demonstrate some of the strengths of this approach.
Example 1: all zeros
Probably the simplest implementation of this interface just generates an infinite stream of zeros:
static ErrorCode refillZeros(BufferedStream *s)
{
static const U8 zeros[256] = { 0 };
s->start = zeros;
s->cursor = zeros;
s->end = zeros + sizeof(zeros);
return s->error;
}
static void initZeroStream(BufferedStream *s)
{
s->Refill = refillZeros;
s->error = NoError;
s->Refill(s);
}
Not much to say here: whenever Refill is called, we reset the pointers to go over the same array of zeros again (the count of 256 is completely arbitrary; you could just as well use 1, but then you’d end up calling Refill a lot).
Example 2: stream from memory
For the next step up, here’s a stream that reads either directly from memory, or from a memory-mapped file:
static ErrorCode fail(BufferedStream *s, ErrorCode reason)
{
// For errors: Set the error status, then continue returning
// zeros indefinitely.
s->error = reason;
s->Refill = refillZeros;
return s->Refill(s);
}
static ErrorCode refillMemStream(BufferedStream *s)
{
// If this gets called, clients wants to read past the end of
// the memory buffer, which is an error.
return fail(s, ReadPastEOF);
}
static void initMemStream(BufferedStream *s, U8 *mem, size_t size)
{
s->start = mem;
s->cursor = mem;
s->end = mem + size;
s->error = NoError;
s->Refill = refillMemStream;
}
The init function isn’t very interesting, but the Refill part deserves some explanation: As mentioned before, the client calls Refill when it’s finished with the current buffer and still needs more data. In this case, the “buffer” was all the data we had; therefore, if we ever try to refill, that means the client is trying to read past the end-of-file, which is an error, so we call fail, which first sets the error code. We then change the refill function pointer to refillZeros, our previous example: in other words, if the client tries to read past the end of file, they’ll get an infinite stream of zeros (satisfying our above invariant that Refill always returns at least one more byte). But the error flag will be set, and since refillZeros doesn’t touch it, it will stay the way it was.
In effect, making Refill a function pointer allows us to implicitly turn any stream into a state machine. This is a powerful facility, and it’s great for this kind of error handling and to deal with certain compressed formats (more later), but like any kind of state machine, it quickly gets more confusing than it’s worth if there’s more than a handful of states.
Anyway, there’s one more thing worth pointing out: there’s no copying here. A from-memory implementation of SimpleStream would need to do a memcpy (or something equivalent) in its Read function, but since we don’t let the client decide where the data ends up, we can just point it directly to the original bytes without doing any extra work. Because forwarding is effectively free, it makes it easy to decouple “framing” from data processing even in performance-critical code.
Example 3: parsing JPEG data
Case in point: JPEG data. JPEG files interleave compressed bitstream data and metadata. Metadata exists in several types and is denoted by a special code, a so-called “marker”. A marker consists of an all-1-bits byte (0xff) followed by a nonzero byte. Because the compressed bitstream can contain 0xff bytes too, they need to be escaped using the special code 0xff 0x00 (which is not interpreted as a marker). Thus code that reads JPEG data needs to look at every byte read, check if it’s 0xff, and if so look at the next byte to figure out whether it’s part of a marker or just an escape code. Having to do this in the decoder inner loop is both annoying (because it complicates the code) and slow (because it forces byte-wise processing and adds a bunch of branches). It’s much nicer to do it in a separate pass, and with this approach we can naturally phrase it as a layered stream (I’ll skip the initialization this time):
struct JPEGDataStream : public BufferedStream {
BufferedStream *src; // Read from here
};
static ErrorCode refillJPEG_lastWasFF(BufferedStream *s);
static ErrorCode refillJPEG(BufferedStream *str)
{
JPEGDataStream *j = (JPEGDataStream *)str;
BufferedStream *s = j->src;
// Refill if necessary and check for errors
if (s->cursor == s->end) {
if (s->Refill(s) != NoError)
return fail(str, s->error);
}
// Find next 0xff byte (if any)
U8 *next_ff = memchr(s->cursor, 0xff, s->end - s->cursor);
if (next_ff) {
// return bytes until 0xff,
// continue with refillJPEG_lastWasFF.
j->start = s->cursor;
j->cursor = s->cursor; // EDIT typo fixed (was "current")
j->end = next_ff;
j->Refill = refillJPEG_lastWasFF;
s->cursor = next_ff + 1; // mark 0xff byte as read
} else {
// return all of current buffer
j->start = s->cursor;
j->cursor = s->cursor;
j->end = s->end;
s->cursor = s->end; // read everything
}
}
static ErrorCode refillJPEG_lastWasFF(BufferedStream *str)
{
static const U8 one_ff[1] = { 0xff };
JPEGDataStream *j = (JPEGDataStream *)str;
BufferedStream *s = j->src;
// Refill if necessary and check for errors
if (s->cursor == s->end) {
if (s->Refill(s) != NoError)
return fail(str, s->error);
}
// Process marker type byte
switch (*s->cursor++) {
case 0x00:
// Not a marker, just an escaped 0xff!
// Return one 0xff byte, then resume regular parsing.
j->start = one_ff;
j->cursor = one_ff;
j->end = one_ff + 1;
j->Refill = refillJPEG;
break;
// Was a marker. Handle other cases here...
case 0xff: // 0xff 0xff is invalid in JPEG
return fail(str, CorruptedData);
default:
return fail(str, UnknownMarker);
}
}
It’s a bit more code than the preceding examples, but it demonstrates how streams can be “layered” and also nicely shows how the implicit state machine can be useful.
Discussion
This clearly solves the first issue I raised – there’s no unnecessary copying going on. The interface is also really simple: Everything is in memory. Client code still has to deal with partial results, but because the client never gets to specify how much data to read in the first place, the interface makes it a lot clearer that handling this case is necessary. It also makes it a lot more common, which goes a long way towards making sure that it gets handled. Finally, error handling – this is only partially resolved, but the invariants have been picked to make the life of client code as easy as possible: All errors are “sticky”, so there’s no risk of “missing” an error if you do something else in between doing the read and checking for errors (a typical issue with errno, GetLastError and the like). And our error-“refill” functions still return data (albeit just zeros in these examples). These two things together mean that, in practice, you only really need to check for errors inside loops that might not terminate otherwise. In all other cases, you will fall through to the end of functions eventually; as long as you check the error code before you dispose of the stream, it will be noticed.
One really cool thing you can do with this type of approach (that I don’t want to discuss in full or with code because it would get lengthy and is really only interesting to a handful of people) is give the client pointers into the window in a LZ77-type compressor. This completely removes the need for any separate output buffers, saving a lot of copying and some small amount of memory. This is especially nice on memory-constrained targets like SPUs. In general, giving the producer (instead of the consumer) control over where the buffer is in memory allows producer code to use ring buffers (like a LZ77 window) completely transparently, which I think is pretty cool.
There’s a lot more that could be said about this, and initially I wanted to, but I didn’t want to make this too long; better have two medium-sized posts than one huge post that nobody reads to the finish. Besides, it means I can base a potential second post on actual questions I get, rather than having to guess what questions you might have, which I’m not very good at.





