Skip to content

Cache coherency primer

July 7, 2014

I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there are a few basics I need to cover first. In this post, I’ll try just that.


This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there are some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.

And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the

Basic invariant: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.

Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.

Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.

Write-back invariant: after writing back all dirty cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.

In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.

Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.

Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.

Coherency protocols

As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?

Well, the answer is quite simple: nothing happens. And that’s bad, because we want something to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.

Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.

This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a lot of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-core problem as it is a multi-cache problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave as if there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the MESI protocol.

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

  • Invalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
  • Shared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
  • Exclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
  • Modified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.
Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant: after writing back all dirty (M-state) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition, at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches..

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full sequential consistency, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

  • Caches do not respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
  • Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
  • In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). They stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in yet another earlier post. The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

From → Coding

  1. Because I am the grammar Nazi I noticed a few minor errors. FWIW. Delete this comment after reading if you want. In particular, ‘are’ ain’t contractable.

    usage so, well, get used

    virtually-addressed caches

    there’s only one L1$

    There are multiple types

    But if there are write-back caches

    there are essentially two camps

    there are no guarantees

  2. You said that MESI could, potentially, provide full sequential consistency, which is true but together with the rhetoric afterwards could be interpreted as x86 offering sequential consistency. In fact sequential consistency is really too expensive. It means (IIRC) that writes have to wait for all preceding reads, which is sometimes insanely expensive and really not worth it.

    Excellent article. I look forward to the follow-up.

    I’ve always find it curious (and it is a perpetual source of confusion to many developers) that reordering of memory operations is completely orthogonal to reordering of instructions. And, in fact, most CPUs that reorder instructions do not reorder memory operations, and vice-versa. Odd but true.

    • “In fact sequential consistency is really too expensive. It means (IIRC) that writes have to wait for all preceding reads, which is sometimes insanely expensive and really not worth it.”
      In the x86 memory model, writes do have to wait for all preceding reads, always. The IA-32 architecture manuals spell this out clearly in section 8.2.2, “Memory Ordering in P6 and More Recent Processor Families”. I quote:

      • Reads are not reordered with other reads.
      • Writes are not reordered with older reads.
      • Writes to memory are not reordered with other writes, with the following exceptions [CLFLUSH, streaming stores, string operations].

      In other words, writes may be reordered with newer reads (i.e. they may finish later than they occur in program order – a natural consequence of having store buffers) but for the most part, that’s it. This is not full sequential consistency (since after all writes may happen later than in program order) but it’s close. I believe x86’s model is essentially the same as SPARC’s “Total Store Order” model, but the details get hairy.

      This sounds as though x86’s really can’t do much memory reordering at all, and as if it had to be really slow. But that’s not true. x86 cores can and do reorder memory operations – they’re just not allowed to get caught :). This is where the whole Memory Order Buffer business comes into place. MOB entries are assigned in program order, and are also retired in program order. Imagine that all memory operations in the MOB get tagged with a time stamp (just a counter) that’s assigned in an order consistent with the memory model. That’s not necessarily how it works internally, but it’s a decent model.

      While an entry is in the MOB (i.e. a memory operation is pending and not yet retired), we need to watch out for external events (i.e. cache invalidations) that can invalidate the results of that operation. When such an event occurs, that operation was invalid (same as if it was a mispredicted branch), and so are all instructions that followed it in “memory order”. So now the core needs to roll back the state to right before it executed that instruction, and try again. The primary cost, as far as I can tell (and the devil’s in the details, so having never implemented it, I might well be missing something important!), is that you really need to be able to detect ordering violations for all cache lines referenced by any instruction in the MOB. I believe other processors with an explicit locked cache line and LL/SC semantics (e.g. POWER or ARM) only need to be able to detect this case for the locked cache line, which is much simpler. On the other hand, I believe it’s that same infrastructure that allowed Intel to put the transactional memory extensions into Haswell with relatively little work; most of the heavy lifting was already being done anyway!

      Anyway, it’s essentially just another kind of speculative execution. Just as we can speculate on our branch prediction being correct, and rolling back operations when it wasn’t, we can speculate on memory ordering violations being rare, and roll back when they occur.

  3. Helpful article! You may want to add a paragraph describing N-way associative caching, and perhaps link to Ulrich Drepper’s “What Every Programmer Should Know About Memory” (, which is a great read.

    • I intentionally left out cache associativity and several other topics. (Each section has keywords with additional topics you can search for.)

      Set associativity is an important part of how caches work but has no effect whatsoever on cache coherency which is the topic of this post.

  4. The stay around in this queue for a while until the cache
    The should be They

    the machine is state is rolled back to the last time
    Delete the first is

    Thanks for writing this, I learned a lot, and it was at just the right level of abstraction for me to understand as a software dev.

    Do you have a post about complications with distributed shared memory? Why does it seem that software based coherence protocols like Paxos are more popular than hardware based ones like Microsoft’s FaRM?

    Have you heard of ScaleMP? They’re a hypervisor that presents memory from many small nodes as a single address space. How do you think they would maintain coherency?

    • Oops, this got buried in the comment approval queue for a while. Thanks for your corrections, all fixed!

      I have never interacted with distributed shared memory and can’t comment on it.

  5. puneet permalink

    Lets have an instance where we have two cores each having L1$ and shared L2 and the model is coherent. Now, at a given time the memory say 0xFFFF_CCCC which is preoccupied in L1$ with value=0x0, Core0 write 1 at 0xFFFF_CCCC and Core1 reads 0xFFFF_CCCC. What will core0 reads in write-back mode or write through mode? I believe in-spite if being coherent at given instant core0 will read 0xFFFF_CCCC as 0x1 and core1 will read 0x0.

    • Core 0 can’t write before acquiring cache line exclusive ownership, which invalidates core 1’s copy.

  6. Excellent post.

  7. A question from a beginner that everybody else probably knows the answer to – I understand the need for a cache but I don’t understand why there are multiple levels of cache instead of having just one larger level. In other words, let’s say the L1 cache is 32K, the L2 cache is 256K, and the L3 cache is 2M, why not have a single 32K + 256K + 2M L1 cache?

    Some trivial typos:
    Some trivial typos:
    “there’s a bunch of basics” -> “there are a bunch of basics”
    “there’s some exceptions for memory-mapped IO” -> “there are some exceptions for memory-mapped IO”

    • Thanks for the typos, fixed!

      Replying to your main question took me a while and was meaty enough to turn it into its own post, which is now here. Thanks for the question and I hope the wait’s been worth it. :)

    • Actually, “there are a bunch of basics” is incorrect. The rule here is, the subject must match the verb in number. There’s only one bunch, so “there is a bunch”. This is a common overcorrection because people overlook the effect of “bunch of” and try to match the verb number to “basics”. Not correct.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: