How CPU Caches Work
- 18 minsA Friendly Tour of CPU Caches
This post is a tour of how CPU caches actually work, using a simple real‑world picture: a busy office with desks, cabinets, and a big archive room.
What we’ll cover
Here’s the path we’ll walk:
- Cache overview
- Data layout and cache friendliness
- Locality of reference
- Cache lines
- Sets, ways, and tags
- Replacement policies
- Cache writes and consistency
- Write-through vs write-back
- Write-allocate vs no-write-allocate
- Cache hierarchy (L1, L2, L3)
- SRAM vs DRAM
- Summary
Along the way we’ll sprinkle Tiny explainers to keep the tone light.
1. Cache overview
Imagine you work in a busy office.
- There is a huge archive room with shelves of folders. That’s your main memory (RAM).
- You have a small cabinet next to your desk with the files you use a lot. That’s like your L2 / L3 cache.
- On your desk, you keep a tiny pile of folders open right in front of you. That’s your L1 cache.
The closer something is to you:
- The faster you can grab it.
- The fewer things you can keep there (space is limited).
In the CPU, the picture looks like this:

The CPU always tries to use data from the closest place first. If it can’t find it there, it walks further away:
L1 → L2 → L3 → RAM
Every time it has to “walk to the archive room” your program slows down a bit. Caches exist so that doesn’t happen too often.
Tiny explainer - CPU core
A core is like one worker inside your computer’s “office” A CPU with 8 cores has 8 workers. Each core usually has its own small caches, but they share main memory.
2. Data layout and cache friendliness
Let’s say you manage files for 1,000 customers. For each customer you have:
- A contact sheet
- A billing sheet
- An orders sheet
You can organize your desk in two ways.
Option A - One big folder per customer (object-oriented)
You keep a big folder per customer.

Now your manager walks over and says: “Update the billing info for all customers.”
You now:
- Grab Customer A’s folder → flip to billing → update
- Grab Customer B’s folder → flip to billing → update
- Grab Customer C’s folder → and so on…
If your desk is crowded, some of these folders are:
- On the desk,
- Some in the side cabinet,
- Some already back in the archive.
You end up constantly pulling a folder from somewhere, opening it, flipping pages, updating, putting it back. Stop‑start, lots of jumping around.
This is similar to a object‑oriented memory layout: each “object” (customer) contains all fields (contact, billing, orders) inside it, and those objects end up scattered across RAM. When the CPU wants “billing for all customers”, it keeps jumping to different places in memory.
Option B - Group by type (data-oriented)
Instead, you could make three stacks:

Now when your manager says:
“Update the billing info for all customers.”
You:
- Take the billing stack
- Start at the top
- Go straight down: B0 → B1 → B2 → …
This is data‑oriented design. It lines up similar data in memory so the CPU can walk through it in order.
Tiny explainer - “data-oriented” vs “object-oriented”
Object-oriented: “all things for this one customer live together” but different customers may be scattered.
Data-oriented: “all billing sheets live together, all contacts live together” so operations that touch one kind of data can run in smooth loops.
3. Locality of reference
Caches rely on a simple observation about most programs: when the CPU uses a piece of data once, it will likely touch it again soon and will likely touch nearby data as well.
There are two names for this:
- Temporal locality - time based
- Spatial locality - location based
Temporal locality is about time: recently used → likely to be used again. Spatial locality is about location: used something at address X → likely to use X+1, X+2, X+3 soon.
Your desk stack reflects temporal locality: you keep recent folders near you. Your billing stack reflects spatial locality: items that will be processed together are next to each other.
Caches lean on both ideas.
4. Cache lines
The CPU doesn’t fetch one byte at a time from memory. That would be far too slow. Instead it fetches fixed‑size chunks called cache lines.
On many systems, a cache line is 64 bytes long. You can picture it like this:

When the CPU reads a single value from memory, the cache pulls in the entire line that contains that value. If your billing sheets are laid out one after another in memory, a single line can hold several consecutive billing entries.
Accessing B1 might pull B0 and B2 into the cache as well. If the program soon touches B2 or B3, they are already “on the desk”.
If the data is scattered (like folders all over the office), each access can drag in a line mostly filled with things you never touch, wasting cache space.
Tiny explainer - 1 byte, 1 KB, 1 MB
1 byte ≈ 1 letter of text.
1 KB (kilobyte) ≈ 1,000 bytes.
1 MB (megabyte) ≈ 1,000,000 bytes.
A 64‑byte cache line is tiny compared to modern memory sizes.
5. Sets, ways, and tags
Inside the CPU, the cache is not just a random pile of lines.
Most CPU caches are set‑associative: they’re divided into rows called sets, and each set contains a small number of slots called ways, each way holding one cache line.

Each row is a set. Each box in that row is a way.
If there are 4 boxes per row, we call it a 4‑way set‑associative cache.
Tiny explainer - associativity
Associativity is just “how many slots per set”. 2‑way = 2 lines per set, 4‑way = 4 lines per set, etc. More ways gives more flexibility but more hardware complexity.
When the CPU asks for data at some memory address, the cache doesn’t search everything. It uses part of the address to decide which row (which set) to look in.
You can think of the address being split into three fields:
- A set index: which set (row) to use
- An offset: where inside the 64‑byte line the exact byte is
- A remaining tag: a label that says which region of memory this line belongs to
In schematic form:

Lookup works like this:
- The cache uses the index bits to select a set.
- It checks the tags of the lines in that set to see if any match.
- If a tag matches, that’s a cache hit. The offset tells it which byte to return from that line.
- If none match, that’s a cache miss, the cache must fetch the corresponding line from a lower cache level or from RAM, put it in one of the slots in that set, and then serve the request.
Tiny explainer - hit vs miss
Hit: the thing you need is already in the cache → fast. Miss: it isn’t, so the CPU has to go further out (L2 / L3 / RAM) → slower.
This “rows and a few slots per row” structure is a compromise between two extremes:
- You could force each address to go to exactly one slot (very simple, but leads to more conflicts).
- You could allow each address to go anywhere (very flexible, but hard and slow to search). Set‑associative caches sit in the middle.
6. Replacement policies
Each set has a fixed number of slots. Eventually, all the slots in a set will be occupied. When a miss happens and a new line needs to be inserted into that full set, the cache has to evict one of the existing lines.
This is exactly like your desk: if you allow only four open folders and you need a fifth, one of the four must go back into the cabinet.
The rule for choosing which line to evict is the replacement policy.
The most natural policy is called Least Recently Used (LRU): throw out the line you haven’t touched for the longest time. If you haven’t looked at a folder in hours and you’ve used the other three in the last minute, the “cold” folder is the best candidate to return to the cabinet.
Implementing exact LRU, however, is expensive in hardware, because it requires tracking the exact order of use within each set and updating that on every access. For small sets this is doable, for larger ones it becomes complicated.
To get most of the benefit at lower cost, many CPU caches use approximations of LRU.

One approach is tree‑based pseudo‑LRU. Think of the four slots in a set as leaves of a tiny binary tree. Each internal node stores one bit indicating which side (left or right) was used more recently. When you access a line, you walk down the tree and update these bits along the path to mark that side as “most recent”. When you need to evict, you walk the tree in the opposite direction, following the “less recent” directions to arrive at a line that is likely to be older.
Another simple approximation is Not Recently Used (NRU). Each line gets a single bit. When you access that line, you clear its bit to 0 (“recently used”). When you need to evict something, you look for a line whose bit is still 1 (“not recently used”). If every line’s bit is 0, you reset all bits to 1 and then choose one. It’s not perfect, but it tends to evict lines that haven’t been touched in a while.
The exact details differ between cache levels and CPU designs, but the purpose stays the same: find a line that is unlikely to be used again soon, and reuse that slot.
7. Cache writes and consistency
So far, we’ve focused on reading: either the data is in the cache (hit) or it isn’t (miss).
Reads are conceptually simple. Writes, on the other hand, create a new problem: once you write to data in the cache, the value in the cache and the value in RAM may no longer match.
If the CPU has multiple cores, things get even trickier. Imagine two workers:
- Worker A pulls a customer folder, changes the phone number, and keeps the folder on their desk.
- Worker B pulled the same folder earlier and still has the old phone number in their copy.
- The archive room also still has the old version. Now there are three different versions of the “truth”.
In computers, making sure different cores see consistent values is the job of cache coherence protocols. Those are an entire topic on their own. At the level of a single cache, though, we already need two decisions:
- On a write hit (the line is already in the cache), do we update RAM immediately or delay it?
- On a write miss (the line is not in the cache), do we bring it into the cache first or write straight to RAM? These lead to write‑through vs write‑back and write‑allocate vs no‑write‑allocate.
8. Write-through vs write-back
First, assume the cache already holds the line we want to modify. That’s a write hit.
There are two main strategies.
Write-through
With write‑through, every time you update the cached copy, you also immediately update RAM. The cache and RAM are kept in sync at all times.
In the office analogy, this means:
- You modify the folder on your desk.
- Then you immediately walk to the archive room and update the master copy too. This policy is simpler to reason about, but it generates a lot of traffic to RAM: every write goes all the way out.
Write-back
With write‑back, the cache updates only its own copy at first. RAM is updated later, usually when the cache line is evicted or explicitly flushed.
To keep track of which lines differ from RAM, the cache adds a dirty bit to each line. When a line is modified, its dirty bit is set to 1. When that line must be evicted, the cache checks the bit: if it is 1, the line is written back to RAM, if it is 0, the line can be dropped without writing.
This policy can significantly reduce the number of RAM writes, because multiple updates to the same line get combined into a single write‑back. The trade‑off is extra complexity, and the risk that if power is lost at the wrong moment, dirty lines may not yet be reflected in RAM.
Tiny explainer - dirty vs clean
Clean line: cache and RAM have the same data. Dirty line: cache contains newer data than RAM, RAM must be updated before the line is discarded.
9. Write-allocate vs no-write-allocate
Now consider a write miss: the CPU wants to write to an address that is not currently represented in the cache.
There are two main options.
Write-allocate
With write‑allocate, the cache treats the write miss similar to a read miss:
- It fetches the corresponding line from RAM into the cache.
- It applies the write to the cached line.
If the cache uses write‑back, the line is now marked dirty.
This approach assumes that if you are writing to a location, you might well read or write it again soon. Once the line is in the cache, future accesses will hit.
No-write-allocate
With no‑write‑allocate, the cache does not bring the line in on a write miss. Instead, it sends the write directly to RAM and leaves the cache unchanged.
In office terms: if you only need to jot a quick note in a file you never expect to touch again, you might walk to the archive, write the change directly in the master copy, and leave it there instead of bringing it to your desk.
This is useful for patterns like logging or streaming where data is written once and rarely read again. Pulling that data into the cache would just evict more useful lines.
In practice, many CPU data caches use write‑back + write‑allocate as a default combination: it keeps recently written data closer to the core and avoids unnecessary RAM traffic. Other parts of the system might use write‑through + no‑write‑allocate where simplicity or consistency are more important than raw speed.
Tiny explainer - instruction vs data cache
Instruction cache: holds the CPU instructions (the program code) it expects to run soon.
Data cache: holds the variables and data that code works on.
Separating them (at L1) avoids some types of interference.
10. Cache hierarchy: L1, L2, L3
Modern CPUs use more than one cache level:
- L1: the smallest and fastest, very close to the core
- L2: larger, a little slower, usually per core
- L3: larger again, slower than L2, typically shared across cores
L1 is often split into a data cache and an instruction cache:
- The instruction cache holds the machine code that the CPU is about to execute.
- The data cache holds the variables and data the code is working with.
When the CPU needs data, it checks each level in order, starting from the closest:
- L1: check the tiny, very fast cache
- If miss: check L2
- If miss: check L3
- If miss: go to RAM
There is one more structural choice: how multiple levels relate to each other.
An inclusive cache hierarchy ensures that if a line is in L1, it is also in L2 (and possibly in L3). That is, L2 “includes” everything in L1. This can simplify some coherence operations.
An exclusive hierarchy enforces that a line can live in only one level at a time. If a line is in L1, it is not in L2 and vice versa. This avoids duplicating the same line in multiple caches and can increase total effective capacity.
A non‑inclusive, non‑exclusive hierarchy does not enforce either property strictly. Lines may or may not be duplicated between levels, depending on what the cache controller decides is best.
For day‑to‑day work as a programmer, the important point is that the closer levels are much faster, much smaller, and much more sensitive to how your data is laid out and accessed. If the working set of your code fits in L1 or L2 and is accessed with good locality, it will run noticeably faster than if it constantly pulls data from RAM.
11. SRAM vs DRAM
So far, the differences between caches and RAM have mostly been about location and structure. There is also a hardware reason: caches and RAM are built out of different memory technologies.
Caches use SRAM (Static Random Access Memory). Main memory uses DRAM (Dynamic Random Access Memory).
An SRAM cell typically uses six transistors arranged as a tiny flip‑flop that can hold a 0 or 1 as long as power is applied. Reads and writes are direct and fast. There is no need to refresh the stored value periodically.
A DRAM cell, by contrast, uses one transistor and one capacitor. The presence or absence of charge in the capacitor represents a bit. This design is very compact and cheap per bit, which is why you can have many gigabytes of DRAM on a system.
However, capacitors leak charge over time. That means DRAM cells must be refreshed regularly: each cell is periodically read and then rewritten to restore the charge. Reading a DRAM cell is also effectively destructive: the sense amplifier drains the capacitor to determine its state and then restores it. Also, DRAM is organized as a matrix of rows and columns, and accessing it involves a two‑step addressing process (selecting a row, then a column), which adds extra delay.
SRAM doesn’t need refresh, and its access path is simpler, so it can be much faster. The cost is area: six transistors per bit occupy much more silicon than one transistor plus one capacitor.
The result is a natural split:
- Use small amounts of SRAM very close to the CPU (caches) for speed.
- Use large amounts of DRAM a bit further away (RAM) for capacity.
12. Summary
Viewed from a distance, CPU caches are a fairly simple idea built out with a lot of detail.
Your CPU has:
- A tiny, very fast store of data close to the core (L1).
- Somewhat larger, still fast stores a bit farther away (L2, L3).
- A much larger but slower main memory (RAM).
It tries hard to keep the data it needs now and next in the close, fast levels, and uses locality of access “recently used” and “nearby in memory” to guess what that data will be.
Internally, caches divide their storage into lines, sets, and ways. They track which lines are in use and evict older ones when new data has to be brought in. They handle writes carefully to keep RAM consistent, sometimes delaying updates with write‑back and sometimes going straight through with write‑through. Multiple cache levels coordinate with inclusion policies. Underneath it all, caches use SRAM and RAM uses DRAM, which is why there is a speed difference in the first place.
For you as a programmer or a curious reader, the core practical ideas are:
- Data layout matters. Keeping related data close together in memory makes good use of cache lines.
- Access patterns matter. Walking sequentially through memory is much friendlier to caches than jumping around.
- Doing more work while data is “hot” (already in cache) is cheaper than revisiting it later after it has been evicted.
That’s the basic story of how CPU caches work: a hierarchy of small, fast memories trying to keep your most relevant data as close to the core as possible, so your programs spend less time waiting and more time doing actual work.
Feedback is extremely welcomed! You can reach out to me on X @0xkato