How CPU Caches Work

- 18 mins

A Friendly Tour of CPU Caches

This post is a tour of how CPU caches actually work, using a simple real‑world picture: a busy office with desks, cabinets, and a big archive room.


What we’ll cover

Here’s the path we’ll walk:

  1. Cache overview
  2. Data layout and cache friendliness
  3. Locality of reference
  4. Cache lines
  5. Sets, ways, and tags
  6. Replacement policies
  7. Cache writes and consistency
  8. Write-through vs write-back
  9. Write-allocate vs no-write-allocate
  10. Cache hierarchy (L1, L2, L3)
  11. SRAM vs DRAM
  12. Summary

Along the way we’ll sprinkle Tiny explainers to keep the tone light.


1. Cache overview

Imagine you work in a busy office.

The closer something is to you:

In the CPU, the picture looks like this:

Cache hierarchy

The CPU always tries to use data from the closest place first. If it can’t find it there, it walks further away:

L1 → L2 → L3 → RAM

Every time it has to “walk to the archive room” your program slows down a bit. Caches exist so that doesn’t happen too often.

Tiny explainer - CPU core
A core is like one worker inside your computer’s “office” A CPU with 8 cores has 8 workers. Each core usually has its own small caches, but they share main memory.

2. Data layout and cache friendliness

Let’s say you manage files for 1,000 customers. For each customer you have:

You can organize your desk in two ways.

Option A - One big folder per customer (object-oriented)

You keep a big folder per customer.

Cache Object Oriented

Now your manager walks over and says: “Update the billing info for all customers.”

You now:

If your desk is crowded, some of these folders are:

You end up constantly pulling a folder from somewhere, opening it, flipping pages, updating, putting it back. Stop‑start, lots of jumping around.

This is similar to a object‑oriented memory layout: each “object” (customer) contains all fields (contact, billing, orders) inside it, and those objects end up scattered across RAM. When the CPU wants “billing for all customers”, it keeps jumping to different places in memory.

Option B - Group by type (data-oriented)

Instead, you could make three stacks:

Cache Data Oriented

Now when your manager says:

“Update the billing info for all customers.”

You:

This is data‑oriented design. It lines up similar data in memory so the CPU can walk through it in order.

Tiny explainer - “data-oriented” vs “object-oriented”
Object-oriented: “all things for this one customer live together” but different customers may be scattered.
Data-oriented: “all billing sheets live together, all contacts live together” so operations that touch one kind of data can run in smooth loops.

3. Locality of reference

Caches rely on a simple observation about most programs: when the CPU uses a piece of data once, it will likely touch it again soon and will likely touch nearby data as well.

There are two names for this:

Temporal locality is about time: recently used → likely to be used again. Spatial locality is about location: used something at address X → likely to use X+1, X+2, X+3 soon.

Your desk stack reflects temporal locality: you keep recent folders near you. Your billing stack reflects spatial locality: items that will be processed together are next to each other.

Caches lean on both ideas.

4. Cache lines

The CPU doesn’t fetch one byte at a time from memory. That would be far too slow. Instead it fetches fixed‑size chunks called cache lines.

On many systems, a cache line is 64 bytes long. You can picture it like this: 64-byte cache line

When the CPU reads a single value from memory, the cache pulls in the entire line that contains that value. If your billing sheets are laid out one after another in memory, a single line can hold several consecutive billing entries.

Accessing B1 might pull B0 and B2 into the cache as well. If the program soon touches B2 or B3, they are already “on the desk”.

If the data is scattered (like folders all over the office), each access can drag in a line mostly filled with things you never touch, wasting cache space.

Tiny explainer - 1 byte, 1 KB, 1 MB
1 byte ≈ 1 letter of text.
1 KB (kilobyte) ≈ 1,000 bytes.
1 MB (megabyte) ≈ 1,000,000 bytes.
A 64‑byte cache line is tiny compared to modern memory sizes.

5. Sets, ways, and tags

Inside the CPU, the cache is not just a random pile of lines.

Most CPU caches are set‑associative: they’re divided into rows called sets, and each set contains a small number of slots called ways, each way holding one cache line.

Cache line

Each row is a set. Each box in that row is a way.

If there are 4 boxes per row, we call it a 4‑way set‑associative cache.

Tiny explainer - associativity
Associativity is just “how many slots per set”. 2‑way = 2 lines per set, 4‑way = 4 lines per set, etc. More ways gives more flexibility but more hardware complexity.

When the CPU asks for data at some memory address, the cache doesn’t search everything. It uses part of the address to decide which row (which set) to look in.

You can think of the address being split into three fields:

Lookup works like this:

  1. The cache uses the index bits to select a set.
  2. It checks the tags of the lines in that set to see if any match.
  3. If a tag matches, that’s a cache hit. The offset tells it which byte to return from that line.
  4. If none match, that’s a cache miss, the cache must fetch the corresponding line from a lower cache level or from RAM, put it in one of the slots in that set, and then serve the request.

Tiny explainer - hit vs miss
Hit: the thing you need is already in the cache → fast. Miss: it isn’t, so the CPU has to go further out (L2 / L3 / RAM) → slower.

This “rows and a few slots per row” structure is a compromise between two extremes:

6. Replacement policies

Each set has a fixed number of slots. Eventually, all the slots in a set will be occupied. When a miss happens and a new line needs to be inserted into that full set, the cache has to evict one of the existing lines.

This is exactly like your desk: if you allow only four open folders and you need a fifth, one of the four must go back into the cabinet.

The rule for choosing which line to evict is the replacement policy.

The most natural policy is called Least Recently Used (LRU): throw out the line you haven’t touched for the longest time. If you haven’t looked at a folder in hours and you’ve used the other three in the last minute, the “cold” folder is the best candidate to return to the cabinet.

Implementing exact LRU, however, is expensive in hardware, because it requires tracking the exact order of use within each set and updating that on every access. For small sets this is doable, for larger ones it becomes complicated.

To get most of the benefit at lower cost, many CPU caches use approximations of LRU.

Tree-based pseudo-LRU for a 4-way set

One approach is tree‑based pseudo‑LRU. Think of the four slots in a set as leaves of a tiny binary tree. Each internal node stores one bit indicating which side (left or right) was used more recently. When you access a line, you walk down the tree and update these bits along the path to mark that side as “most recent”. When you need to evict, you walk the tree in the opposite direction, following the “less recent” directions to arrive at a line that is likely to be older.

Another simple approximation is Not Recently Used (NRU). Each line gets a single bit. When you access that line, you clear its bit to 0 (“recently used”). When you need to evict something, you look for a line whose bit is still 1 (“not recently used”). If every line’s bit is 0, you reset all bits to 1 and then choose one. It’s not perfect, but it tends to evict lines that haven’t been touched in a while.

The exact details differ between cache levels and CPU designs, but the purpose stays the same: find a line that is unlikely to be used again soon, and reuse that slot.

7. Cache writes and consistency

So far, we’ve focused on reading: either the data is in the cache (hit) or it isn’t (miss).

Reads are conceptually simple. Writes, on the other hand, create a new problem: once you write to data in the cache, the value in the cache and the value in RAM may no longer match.

If the CPU has multiple cores, things get even trickier. Imagine two workers:

In computers, making sure different cores see consistent values is the job of cache coherence protocols. Those are an entire topic on their own. At the level of a single cache, though, we already need two decisions:

8. Write-through vs write-back

First, assume the cache already holds the line we want to modify. That’s a write hit.

There are two main strategies.

Write-through

With write‑through, every time you update the cached copy, you also immediately update RAM. The cache and RAM are kept in sync at all times.

In the office analogy, this means:

Write-back

With write‑back, the cache updates only its own copy at first. RAM is updated later, usually when the cache line is evicted or explicitly flushed.

To keep track of which lines differ from RAM, the cache adds a dirty bit to each line. When a line is modified, its dirty bit is set to 1. When that line must be evicted, the cache checks the bit: if it is 1, the line is written back to RAM, if it is 0, the line can be dropped without writing.

This policy can significantly reduce the number of RAM writes, because multiple updates to the same line get combined into a single write‑back. The trade‑off is extra complexity, and the risk that if power is lost at the wrong moment, dirty lines may not yet be reflected in RAM.

Tiny explainer - dirty vs clean
Clean line: cache and RAM have the same data. Dirty line: cache contains newer data than RAM, RAM must be updated before the line is discarded.

9. Write-allocate vs no-write-allocate

Now consider a write miss: the CPU wants to write to an address that is not currently represented in the cache.

There are two main options.

Write-allocate

With write‑allocate, the cache treats the write miss similar to a read miss:

If the cache uses write‑back, the line is now marked dirty.

This approach assumes that if you are writing to a location, you might well read or write it again soon. Once the line is in the cache, future accesses will hit.

No-write-allocate

With no‑write‑allocate, the cache does not bring the line in on a write miss. Instead, it sends the write directly to RAM and leaves the cache unchanged.

In office terms: if you only need to jot a quick note in a file you never expect to touch again, you might walk to the archive, write the change directly in the master copy, and leave it there instead of bringing it to your desk.

This is useful for patterns like logging or streaming where data is written once and rarely read again. Pulling that data into the cache would just evict more useful lines.

In practice, many CPU data caches use write‑back + write‑allocate as a default combination: it keeps recently written data closer to the core and avoids unnecessary RAM traffic. Other parts of the system might use write‑through + no‑write‑allocate where simplicity or consistency are more important than raw speed.

Tiny explainer - instruction vs data cache
Instruction cache: holds the CPU instructions (the program code) it expects to run soon.
Data cache: holds the variables and data that code works on.
Separating them (at L1) avoids some types of interference.

10. Cache hierarchy: L1, L2, L3

Modern CPUs use more than one cache level:

L1 is often split into a data cache and an instruction cache:

When the CPU needs data, it checks each level in order, starting from the closest:

There is one more structural choice: how multiple levels relate to each other.

An inclusive cache hierarchy ensures that if a line is in L1, it is also in L2 (and possibly in L3). That is, L2 “includes” everything in L1. This can simplify some coherence operations.

An exclusive hierarchy enforces that a line can live in only one level at a time. If a line is in L1, it is not in L2 and vice versa. This avoids duplicating the same line in multiple caches and can increase total effective capacity.

A non‑inclusive, non‑exclusive hierarchy does not enforce either property strictly. Lines may or may not be duplicated between levels, depending on what the cache controller decides is best.

For day‑to‑day work as a programmer, the important point is that the closer levels are much faster, much smaller, and much more sensitive to how your data is laid out and accessed. If the working set of your code fits in L1 or L2 and is accessed with good locality, it will run noticeably faster than if it constantly pulls data from RAM.

11. SRAM vs DRAM

So far, the differences between caches and RAM have mostly been about location and structure. There is also a hardware reason: caches and RAM are built out of different memory technologies.

Caches use SRAM (Static Random Access Memory). Main memory uses DRAM (Dynamic Random Access Memory).

An SRAM cell typically uses six transistors arranged as a tiny flip‑flop that can hold a 0 or 1 as long as power is applied. Reads and writes are direct and fast. There is no need to refresh the stored value periodically.

A DRAM cell, by contrast, uses one transistor and one capacitor. The presence or absence of charge in the capacitor represents a bit. This design is very compact and cheap per bit, which is why you can have many gigabytes of DRAM on a system.

However, capacitors leak charge over time. That means DRAM cells must be refreshed regularly: each cell is periodically read and then rewritten to restore the charge. Reading a DRAM cell is also effectively destructive: the sense amplifier drains the capacitor to determine its state and then restores it. Also, DRAM is organized as a matrix of rows and columns, and accessing it involves a two‑step addressing process (selecting a row, then a column), which adds extra delay.

SRAM doesn’t need refresh, and its access path is simpler, so it can be much faster. The cost is area: six transistors per bit occupy much more silicon than one transistor plus one capacitor.

The result is a natural split:

12. Summary

Viewed from a distance, CPU caches are a fairly simple idea built out with a lot of detail.

Your CPU has:

It tries hard to keep the data it needs now and next in the close, fast levels, and uses locality of access “recently used” and “nearby in memory” to guess what that data will be.

Internally, caches divide their storage into lines, sets, and ways. They track which lines are in use and evict older ones when new data has to be brought in. They handle writes carefully to keep RAM consistent, sometimes delaying updates with write‑back and sometimes going straight through with write‑through. Multiple cache levels coordinate with inclusion policies. Underneath it all, caches use SRAM and RAM uses DRAM, which is why there is a speed difference in the first place.

For you as a programmer or a curious reader, the core practical ideas are:

That’s the basic story of how CPU caches work: a hierarchy of small, fast memories trying to keep your most relevant data as close to the core as possible, so your programs spend less time waiting and more time doing actual work.

comments powered by Disqus