What is Cache?

A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.

Fundamental idea of a memory hierarchy

For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k + 1.
To be precise, since most accesses are solved at level k, k + 1 is used less often and does not need to be very fast.

Why do memory hierarchies work?

Because of locality, programs tend to access the data at level k more often than they access the data at level k + 1.
Thus, the storage at level k + 1 can be slower, and thus the larger and cheaper per bit.

Usage of memory hierarchy?

It creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.

The memory hierarchy

Register

Cache

Caches can be split into three different type: L1, L2, and L3.

L1 Cache (Level 1)
- Built inside the CPU
- Very small (~32–64 KB)
- Extremely fast
- Holds instructions or data the CPU is using right now
L2 Cache (Level 2)
- Slightly bigger (~256 KB–1 MB)
- Slower than L1 but still fast
- Stores data/instructions that may be needed soon
L3 Cache (Level 3)
- Shared among CPU cores (~4–32 MB)
- Slower than L2, bigger
- Stores data that might be reused by any core

These all L1, L2, and L3 caches are made of so-called SRAM.

SRAM?

SRAM stands for Static Random Access Memory.
RAM is computer's main memory. Let's break down why it's called Static RAM. SRAM uses flip-flop (a small circuit of transistors) to store each bit.
Once a bit is stored, it stays there as long as power is on, without needing to be refreshed.
That's why it's called "static", it does not change unless CPU writes to it.

Main Memory

Main memory is the computer's primary workspace where programs and data live while the computer is running.

It's larger than registers and caches.
Usually made of DRAM (Dynamic RAM)
Volatile: Losses data when power is off.

DRAM?

DRAM stores each bit as a tiny capacitor. It leaks charge over time, so each bit must be refreshed to keep its value. That's why it's called dynamic, it's constantly changing unless refreshed.

Understanding CPU → Cache → RAM Flow

When a program runs, the CPU constantly reads and writes data from memory.
However, RAM is much slower than the CPU.

To reduce this speed gap, computers use cache memory — a small, very fast memory located close to the CPU.

Step 1: CPU Requests Data

When the CPU needs data:

It requests a specific memory address in RAM.
But before accessing RAM, it checks the cache first.

Why?
Because cache is much faster than RAM.

Step 2: How the Cache is Checked

The method depends on the cache architecture.

1) Direct Mapped Cache

Each RAM address maps to exactly one specific cache block.
The CPU checks only that one block.
Simple and fast.
But collisions can occur.

Analogy:
"This book must always go on shelf #5."

2) Fully Associative Cache

Any RAM address can be stored in any cache block.
The CPU may search the entire cache.
More flexible.
More complex hardware.

Analogy:
"You can put the book on any empty shelf."

3) Set-Associative Cache

Each RAM address maps to one specific set.
Within that set, the data can be stored in multiple blocks.
The CPU searches only within that set.
A balance between the two extremes.

Analogy:
"Your book belongs in Section A, but you can choose any shelf inside Section A."

Step 3: Cache Hit vs Cache Miss

Cache Hit ✅

Data is found in cache.
CPU uses it immediately.
Very fast.

Cache Miss ❌

Data is not in cache.
CPU fetches it from RAM.
The data is copied into the cache.
Then the CPU uses it.
Slower than a hit.

Why This Works

Programs tend to reuse data frequently (this is called locality).

Because of locality:

Frequently used data stays in fast cache.
Less-used data stays in larger, slower RAM.

This design allows computers to:

Feel as fast as small memory
But store as much data as large memory

🔹 Cache Line Structure

Each cache line (a single “slot” in the cache) has three parts:

| Part | Meaning | |-----------|---------| | Valid Bit | 1 if the line contains valid data, 0 if empty | | Tag | Helps identify which RAM block this line belongs to | | Data Block (2^b bytes) | The actual data stored, size depends on block size |

b = number of block offset bits → tells CPU which byte inside the block to read
Example: if b = 3 → 2³ = 8 bytes/block → offset 0~7

Key point: The CPU only cares about addresses and bytes.
It reads as many bytes as the data type size (int = 4 bytes, char = 1 byte), but cache just provides bytes starting from the offset.

🔹 Address Breakdown

For a memory address, we split it like this: | Tag | Set Index | Block Offset |

Block Offset (b bits) → which byte inside the block
Set Index (s bits) → which set in the cache the block belongs to
Tag (t bits) → identifies which block in RAM this is

Example:

Address = 7 (0111)
Cache block size = 2 bytes → b = 1
2 sets → s = 1
Remaining bits → tag = 2

Tag = 01, Set Index = 1, Block Offset = 1

🔹 Cache Simulation Example (Direct-Mapped)

Setup:

Memory = 16 bytes → 4-bit addresses
Block size = 2 bytes → b = 1
Sets = 4 → s = 2
1 block per set → Direct-Mapped
Address trace: 0, 1, 7, 8, 0

Hit/Miss Result:

| Address | Set | Tag | Hit/Miss | |---------|-----|-----|----------| | 0 | 00 | 0 | Miss | | 1 | 00 | 0 | Hit | | 7 | 11 | 0 | Miss | | 8 | 00 | 1 | Miss | | 0 | 00 | 0 | Miss |

Notes:

Overwriting a block in a set = collision
Cache Hit: data already in cache → fast
Cache Miss: data not in cache → fetch from RAM → slower

🔹 Cache Simulation Example (2-Way Set-Associative)

Setup:

Memory = 16 bytes → 4-bit addresses
Block size = 2 bytes → b = 1
Sets = 2 → s = 1
Lines per set = 2 → 2-way set associative
Address trace: 0, 1, 7, 8, 0

Step-by-step Hit/Miss:

| Address | Set | Tag | Hit/Miss | |---------|-----|-----|----------| | 0 | 0 | 00 | Miss | | 1 | 0 | 00 | Hit | | 7 | 1 | 01 | Miss | | 8 | 0 | 10 | Miss | | 0 | 0 | 00 | Hit |

Notes:

Each set can hold 2 blocks, so collisions are less frequent than Direct-Mapped
Cache keeps track of valid bit + tag + data block
Block offset tells where inside the block to read the byte

← Back to home