If you’re reading this via email, I recommend reading in browser to see the visualisations
In this week’s missive I want to talk about PagedAttention, the memory optimisation innovation in vLLM. If you’ve never heard of vLLM, it’s an LLM inference server that’s used to efficiently serve models. Now, PagedAttention is very like virtual memory in operating systems, and indeed the vLLM paper makes that exact comparison. It would be a perfect fit for my “AI infrastructure is recreating computer science fundamentals” thesis, except there are already about half a million posts telling you that PagedAttention is like virtual memory.
So I’m going to do something more interesting. We’ll start off by seeing why serving LLMs is memory intensive (stand up, KV cache!) and, yes, understand how PagedAttention uses ideas from OS virtual memory systems to optimise memory usage. But I want the bulk of this post to focus on how this optimisation entails several interesting consequences that don’t get as much attention. For instance: how you get virtual memory’s payoff without the hardware that usually makes it fast, why the very same sharing that saves memory can leak one user’s prompt to another, and how the cache turned into something you can share across machines (and why you’d care).
By the end of this post you’ll have a good grasp of the computer architectural issues around serving LLMs, the optimisations coming into use and how they link back to existing computer science patterns.
Respect the KV cache
You probably know that large language models can be very… large. Frontier models have billions or trillions of parameters, all needing storage. Even a smallish model intended for local inference is going to require at least a few gigabytes of memory. But when the model starts generating output, memory is swallowed by another source.
Due to decoder models using masked self-attention, each new token attends to every token before it. Attention needs a key and a value vector for every earlier token in every layer of the network. These vectors never change once computed, so the obvious optimisation is to memoise – compute once and store for reuse. That is the KV cache and it gets big quickly.
For a worked example, Llama 3 8B has 32 layers and, because it uses grouped-query attention, just 8 key-value heads per layer, each storing a key and a value vector of length 128 at two bytes per element. That’s 128 KB of cache per token, adding up to half a gigabyte for a single 4,096-token conversation. An 80-layer 70B model is closer to 320 KB per token. Imagine this during a long, agentic workflow reading lots of files into context and you can see how the KV cache will rapidly balloon in size with no warning of when it’ll stop.
Managing the KV cache is critical for LLM inference performance because it limits throughput. A GPU generating text spends most of its time moving weights rather than doing maths. Every decoding step streams all 16 GB of Llama’s 8 billion weights past the compute units, producing one token for each running sequence. Serve one user’s sequence and you get one token. Serve forty users in one batch and you get forty tokens in the same time. Generation is bound by memory bandwidth, not compute, and the same weight-read serves the whole batch at once. So the more sequences you fit in memory, the more tokens each read produces, and your throughput rises with the batch size. The catch is that the batch can only be as large as the KV cache has room for, so every byte you waste on fragmentation is a sequence you can’t run and a user you can’t serve.
That’s why the vLLM team thought memory optimisation was worth fixing with a whole new memory manager in software. PagedAttention is virtual memory for the KV cache. Before vLLM, servers stored each request’s KV cache in a contiguous chunk of GPU memory sized for the longest reply allowed. A request that stopped after 100 tokens kept its full allocation anyway, and requests finishing at different times left gaps too small to use, so only 20–40% of cache memory held live tokens.
Operating system engineers met exactly this problem way back in the 1960s and solved it with paging. For a fuller treatment read The Computer Science Book, but very briefly the solution is: chop memory into fixed-size pieces (pages), let any piece sit anywhere, and keep a per-program page table translating the addresses the program sees into wherever its bytes actually are. The program thinks its address space is arranged consistently in memory while, behind the scenes, the OS is free to rearrange pages to optimise memory usage and reduce fragmentation.
vLLM does the same to the KV cache: virtual blocks of sixteen tokens, a pool of physical blocks, and a per-sequence block table mapping virtual token positions to their physical counterparts. Fragmentation falls from 60–80% of memory to near zero. The only unused memory is the unused tail of each request’s last block.
Watch the visualisation below to understand how it works. You see the same GPU serving the same request stream. On the left is a naive memory allocator and on the right is a vLLM-style paged allocator:
Both have the same users making the same requests with the same arrival times. The naive allocator strands memory inside oversized reservations, so fewer sequences fit and the user queue grows. The paged allocator packs them in. Watch the “tokens generated” counters to see the throughput gap widen. Note that the vLLM allocator consistently wastes less memory.
This is where most explainers stop. But vLLM didn’t just reuse an allocation scheme. Adopting virtual memory entails a whole set of interesting consequences. Let’s look at them in turn.
Indirection cuts both ways
Start with the cost side. Virtual memory is a good idea and the vLLM team deserve praise for implementing it. Pretty much all their competitors adopted similar schemes, which is about the highest praise there is. You might think that such an obvious idea should have been done before. But it’s not so simple!
As the architecture chapter of The Computer Science Book explains, virtual memory leans on a big blob of dedicated circuitry called the memory management unit, or MMU. GPUs have MMUs too. The catch is that an MMU only translates memory addresses, and vLLM’s paging isn’t expressed in addresses at all. So how did they make it work?
Get the free 45-page CS roadmap
Subscribe and I'll send you a free, 45-page roadmap through computer science — what to learn, in what order, and what to skip — plus the occasional CS deep dive.
No spam. Unsubscribe anytime.
The CPU’s MMU walks the page table in hardware, and a small translation cache, the translation lookaside buffer (TLB), makes the common case extremely fast. CPUs dedicate quite a lot of die space to this because paying a table lookup on every memory access would be a disaster for performance.
vLLM’s block table gets no such help. Mapping “sequence 12, token 300” to a physical block isn’t an address translation at all, and address translation is the only thing an MMU does. So the attention kernel (i.e. the code executing on the GPU) has to do the lookup itself, in software, while gathering keys and values from blocks scattered across the pool. Implementing this in a way that doesn’t immediately fritter away the throughput gains is vLLM’s real innovation. vLLM ships a custom CUDA kernel that folds the block table walking into the attention maths in an efficient way.
What’s innovative about vLLM’s PagedAttention isn’t so much that they used virtual memory. It’s that they did the translation themselves in software, mapping tokens to blocks rather than leaning on the GPU’s own address-level virtual memory, and still made it fast.
They did have to keep things simpler than the original, though. OS page tables long ago grew into multi-level trees, because that’s a more efficient way to handle programs’ enormous, mostly empty, address spaces. A token sequence’s cache is small and densely packed from position zero, so vLLM’s block table can get away with using a flat array.
Now for the benefit side. Virtual memory (whether in PagedAttention or an OS) involves some indirection between data and its address. The program sees the “virtual” address and the system maps it to the physical address. The indirection makes some optimisations, such as memory sharing, trivially easy. Because the mapping lives in a table, two sequences’ block tables can point at the same physical block without any interference. Operating systems have exploited this for donkey’s years. Run the same program fifty times and the OS will load the program into memory once. Code pages are read-only and so all fifty page tables can read the same memory without interfering with each other.
vLLM does the same trick with prompts. Let’s take the example of serving a chatbot. Of course, each user will interact with the chatbot differently and create their own conversation. But every conversation will start with the same (probably quite long) system prompt provided to the chatbot. Since a token’s keys and values depend only on the tokens before it, the same system prompt will generate the same KV cache entries. So vLLM computes the system prompt KV vectors once, caches them and points the first entries of every block table at the same physical blocks. Each block carries a reference count, and a block returns to the pool when the last sequence using it finishes. Everyone uses the same system prompt cache.
Writes are where things get more complicated since they break sharing. vLLM’s solution is copy-on-write (COW), lifted straight from the venerable fork() system call. When a process forks, the OS doesn’t immediately copy its entire address space because doing so would pointlessly duplicate a lot of memory. Parent and child share every page read-only. A write triggers a copy only of the single page written. vLLM hits a similar situation when you ask for several candidate completions for one prompt (e.g. for parallel sampling or beam search). The candidates share the prompt’s blocks and diverge as they generate. When one writes into a block that others still reference, vLLM checks the reference count, copies just that block, and creates a separate copy for the writer. The vLLM paper finds that doing so saves 6–30% cache memory for parallel sampling and 38–66% for beam search.
Here’s another visualisation. Step through to see how the blocks are shuffled around as B makes a write:
Incidentally, that same fork() machinery has tripped up the Linux kernel before. Dirty COW, a race condition in Linux’s copy-on-write handling disclosed in 2016, sat undetected for nine years and became one of the most exploited privilege-escalation bugs in the kernel’s history!
When the blocks run out
Adopting virtual memory means new failure modes for vLLM to handle. When physical RAM fills up, an operating system doesn’t just give up. It evicts pages from memory. Some page that hasn’t been touched for a while gets written out to disk and its space reused. If that page is needed again, the access faults and the OS just reads it back in. vLLM’s block pool empties the same way when too many sequences are generating at once, and it responds the same way. The scheduler picks a victim sequence and preempts it, either swapping its blocks out to CPU memory or simply dropping them and recomputing the cache from the victim’s tokens when its turn comes round again.
vLLM even improves a little in one place. An operating system evicts a single page at a time and has to guess which one it’ll least regret evicting. vLLM knows that every block of a sequence is read together on every step, so it doesn’t have to guess. It evicts all of a sequence’s blocks or none of them. There’s no point keeping a user’s conversation partially in memory, is there?
The nastier inheritance is thrashing. Above I said that the OS “just reads back in” a page it needs. When memory becomes too scarce, the OS can end up spending the vast majority of its time swapping pages in and out of memory. The system will rapidly grind to a halt. Push vLLM past its block pool and it does the same, constantly swapping and recomputing instead of generating new tokens. vLLM offers a set of flags to manage this: limit how many jobs run at once (--max-num-seqs), give the pool more memory (--gpu-memory-utilization), or shrink each job’s working set (--max-model-len).
The flags might have new names but the constraints they manage have been around for decades. This is yet another example of why it’s useful to know the fundamentals well. It tells you how a new system will behave before you’ve even read a line of its code!
vAttention takes the idea further
Go back to the cost side. The GPU’s MMU translates addresses, not vLLM’s token-to-block tables, so vLLM does that translation itself in software, inside a hand-written CUDA kernel. That kernel is the clever bit, but it has its disadvantages. The attention kernel is one of the most heavily optimised pieces of code in all of machine learning, and every new trick (a faster FlashAttention, a sharper quantisation scheme) has to be re-implemented to understand the block table before vLLM can use it. Paging leaked out of the memory manager and into every kernel that touches the cache.
In 2024 a group at Microsoft Research looked at this and asked the obvious question. The GPU’s own virtual memory was sitting right there the whole time, so why not use it? Their system, vAttention, gives each sequence a contiguous run of GPU virtual addresses and uses CUDA’s low-level paging APIs to map physical memory underneath them on demand. Because the cache is contiguous in virtual address space, locating a token is straightforward, and the messy virtual-to-physical step is exactly what the MMU already does in hardware. The block table goes away, the custom kernel goes away, ordinary attention kernels run unmodified, and they report generating tokens up to twice as fast as vLLM in places.
So why didn’t vLLM just do this from the start, if it’s so straightforward? It’s partly because PagedAttention came first, and perhaps the hardware route is only obvious with hindsight. But it’s mostly because the hardware support isn’t quite ideal. A software block size is just a number you can tune, whereas the GPU’s virtual-memory API only deals in fixed page sizes, the smallest practical one being a hefty 2 MB. For many models that’s way too big, so vAttention only matches PagedAttention by patching the CUDA driver to allocate smaller pages.
Sharing the cache can leak it
There’s a dangerous security consequence to all this sharing that is rarely mentioned when discussing PagedAttention. The same prefix reuse that makes vLLM fast can expose one user’s prompt to another.
We saw above that sharing works because identical input produces an identical KV cache, so vLLM and most of its competitors will reuse a cached prefix across requests rather than recompute it. A reused prefix is faster. The first token comes back sooner because the prompt’s cache was already sitting in memory. That timing difference is observable and forms an information side channel.
The attack works like so. A patient attacker sends candidate prefixes and watches the time to the first token. A fast reply means the prefix was already cached, which means somebody else recently sent it. Probe a token at a time and you can reconstruct the cached prompt. A 2025 paper doing exactly this against multi-tenant servers, vLLM among them, rebuilt prompts with up to 99% accuracy when the attacker knew the template and 95% without.
This form of attack may sound familiar. Operating systems had the same hole years ago. To save RAM, hypervisors deduplicate identical memory pages across the virtual machines they host and mark the remaining one copy-on-write. Writing to a deduplicated page is a little slower, because the write triggers a copy, and that delay is measurable. Back in 2011 researchers used it to work out which programs were running inside a co-resident VM.
In both cases, the attack was possible because memory was shared to save space, and the timing of a shared-versus-private access is a side channel that reveals what your neighbour is holding. The fix is also similar. Operating systems let you turn deduplication off or scope it so pages only merge inside a trust boundary. Inference servers are learning to scope prefix reuse the same way, so that a cached prompt is only ever reused for the same account.
Safe data sharing is a problem, as old as the hills, that the LLM world is rediscovering one incident at a time.
The cache becomes something you can move
A non-obvious advantage of using a clean block structure is that it makes the KV cache more portable. Once a sequence’s cache is a list of fixed-size blocks rather than one contiguous blob, it becomes much easier to move those blocks around.
The first optimisation is splitting inference in two. The initial processing of a prompt (a.k.a prefilling – I’ll write about this in future) is compute-heavy and bursty. Generating tokens afterwards is memory-bandwidth-heavy and steady. The two would ideally run on different hardware, and newer systems do in fact just do that. They run prefill and decode on separate machines and ship the KV cache from the prefill box to the decode box over a fast interconnect.
The second optimisation is treating the whole system’s memory hierarchy as fair game. Keep hot blocks in GPU memory, spill warm ones to CPU RAM (a bit slower), and park cold ones on SSDs (much slower), exactly the way an OS pages between RAM and disk. Moonshot AI’s Mooncake, which serves their Kimi assistant, builds the entire system around a disaggregated pool of KV cache stitched together from the spare DRAM and SSDs of the cluster, and reports handling up to several times more traffic under the same latency budget.
None of those ideas are new. Sending a sequence’s cache to whichever machine will process it fastest is just a new form of page migration on a NUMA box. Spilling cold blocks to slower, cheaper storage is the same old memory hierarchy. Pooling cache across nodes is distributed shared memory.
vLLM’s innovation did more than just avoid memory fragmentation. The block, that little sixteen-token page, turns the KV cache from a fixed cost into something you can schedule, and a large amount of recent infrastructure work is really just deciding where each block should live.
AI infrastructure reinvents CS fundamentals
Look back over the post and almost every part of vLLM we’ve met is an old idea with a new form. The KV cache is memoisation, PagedAttention is paging, and the copy-on-write behind prompt sharing is lifted straight from fork(). Running out of blocks is swapping, and overdoing it is thrashing. vAttention repeats the decades-old argument about whether address translation belongs in software or hardware, the cache-sharing leak is a memory-deduplication side channel, and shipping caches between machines is the memory hierarchy all over again.
I don’t mean that as a criticism. Reusing good ideas is good engineering, and the vLLM authors knew precisely what they were borrowing and why it was appropriate. But it tells you where the leverage is. The people who see these systems most clearly aren’t the ones who memorised the newest vocabulary fastest. They’re the ones who already understand paging, swapping, caching and copy-on-write, so when a serving system hits a wall they recognise which decades-old idea fits the gap. AI and LLMs are extremely demanding and push computer systems to their limits. CS fundamentals are more relevant than ever.
If you want to get a better understanding of the nitty-gritty, I recommend taking a look at tiny-vllm. It builds a paged KV cache and the attention kernel that reads through it in a couple of thousand lines of C++ and CUDA. It’s pretty readable (by C++/CUDA standards).
And, of course, if you want the foundations the whole thing stands on, that’s what the book is for: the operating systems chapter on paging, swapping and scheduling, and the computer architecture chapter on caches and the memory hierarchy. Learn those once and you’ll keep spotting them, in vLLM and in whatever the next inference server decides to borrow.
Get the free 45-page CS roadmap
Subscribe and I'll send you a free, 45-page roadmap through computer science — what to learn, in what order, and what to skip — plus the occasional CS deep dive.
No spam. Unsubscribe anytime.