Andrew Abok

Where Does Performance Go When Serving an LLM

Where Does Performance Go When Serving an LLM

Introduction

When we first integrated LLMs, we assumed the primary constraint would be compute Hardware memory, model size, or inference speed.Which are really a hindrence but primarily at a higher scale level that is not the first bottle kneck.

The real first constraint is the context length.Given as context windows grow the number of tokens in a request increases as well and this increases:

  • Memory usage increases
  • Latency increases
  • Compute cost increases
  • KV cache grows
  • Hardware utilization spikes

This behavior is not accidental. It is a direct consequence of how transformers compute attention.To understand why this its important we revisit the fundamentals of attention.

LArge lanaguge models are based on the transformer architectures.I’ll deviate just a little bit into the training step to paint a picture literally and metaphorically on how this happens.To build an LLM model we have three stages of it again our focus her eis on the image so we see where attention mehnism sits.

  1. Implementing the data sampling and understanding the basic mechanism (we will get back to this)
  2. Foundation Model - pre-training
  3. Either a classier or a personal assistant i.e instruction data set.

alternate text

What the Transformer Is Actually Doing

AS we have seen thus far that modern LLMs are built on the transformer architecture. At inference time, the pipeline is conceptually simple:

1
Tokens  Embeddings  Multi-Layer Attention  Output Tokens

The expensive part is not tokenization.It is not embedding lookup.It is attention.

Attention allows the model to compute relationships between all tokens in a sequence. Given projected embeddings, This can be computed as:

\[Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V \\ \text{Where}: \\ Q = \text{Queries} \\ K= \text{keys} \\ V = \text{Values} \\ d_k = \text{dimension of key vectiors}\]

When we are computing the:

$Q$ = Query matrix

$K$ = Key matrix

$V$ = Value matrix

$d_k$ = dimension of key vectors

The critical operation inside this equation is $QK^T$ and This is where cost explodes.

Why Context Length Becomes Expensive

Suppose we send n tokens to the model.After embedding:

  • Q is shpae $n \times d$
  • K is Shape $n \times d$

so for us to compute attention $QK^T$, This produces an $N \times N$ matrix.That means:

\[\text{Compute complexity} = O(n^2)\]

so we this shows is that every token attends to every other token.Let’s say if:

  • n = 1,000 → 1,000,000 pairwise interactions
  • n = 8,000 → 64,000,000 interactions
  • n = 16,000 → 256,000,000 interactions

And this happens per layer.With 32 layers the cost multiplies.This is why large prompts are expensive.

Prefill vs Decode Phases in Inference

When the model receives input tokens, inference occurs in two main phases:

  1. Prefill Phase
    • Processes all input tokens in parallel.
    • Generates the initial KV cache for attention.
    • Highly hardware efficient; produces the first output token.
  2. Decode Phase
    • Autoregressively generates tokens one at a time.
    • Uses KV cache from prefill to compute attention over all previous tokens.
    • Primary source of latency for long context windows.

Types of Attention

Why we are touching on this is because overtime different attention mechnism have evolved to try and improve this perfomnace Multi head being the current go to model appraoch to attention mechnism.

There are different attention mechanisms:

  1. Self-attention
  2. Causal attention
  3. Multi-head attention

From a computational perspective they all compute $QK^T$ hence the quadratic scaling remains. Multi-head attention simply performs this multiple times in parallel hence the preference and also improves representation capacity but it does not reduce cost, if anything, it increase it.

Modern serving engines like vLLM introduce paged attention to reduce memory fragmentation and manage KV cache more efficiently, but they do not eliminate the fundamental O(n²) attention scaling.

Real Bottlenecks in Serving LLMs

When we deploy LLM systems, the first bottlenecks is during autoregressive generation, each produced token stores:

  • Key vectors
  • Value vectors

This is the KV cache.

So as conversations grow longer:

  • KV cache grows linearly
  • Memory pressure increases
  • Inference slows

Combined with quadratic attention scaling, long context becomes extremely expensive.This was the bottleneck encountered during writing the service.

alt text

How This Affected the LLM Service

In the LLM service, prompts are built dynamically from:

  • Conversation memory
  • Retrieved documents
  • Structured data
  • The user’s question

Every additional token Expands the attention matrix which ultimately Expands the KV cache and Increases latency together with the API cost.This is not linear growth.It is quadratic growth.

The transformer does not know what is important.It blindly computes attention across all tokens we send.That realization is what read to this research and exploraton.

Why Retrieval Is Necessary

Instead of letting the transformer attend to Entire analytics output entire conversation history and the full docs. We perfom an embedding based similarity search and do a top k selection.MAthematically what we are doing is given a query vector $q$ and document vectors $k_i$, we compute:

\[argmax(q . k_i)\]

This perfoms an extrenal sparse attention thus reduces n,$n^2$ ,kv cache size aand the cost.

The take away here is that The transformer does not know what is important.It blindly computes attention across all tokens.An llm system must Filter,Compress,Prioritize before sending tokens to the LLM model.

Why Infrastructure Optimizations Are Not Enough

Techniques like batching, quantization, speculative decoding, and paged attention improve throughput and hardware utilization. But none of them change the fundamental cost of attention:

\[O(n^2)\]

If context grows uncontrolled, no serving optimization can fully compensate.Whats coming out from this is that an efficient LLM systems must combine:

  • Infrastructure optimization
  • KV cache management
  • Context filtering
  • Retrieval

Architecture and inference must work together.

Conclusions

One of the biggest constraints LLM’s have is on context length, this is a difficult problem to solve as memory usage rises significantly with longer context windows in the current transformer architectures.

To be able to efficiently serve large language models it is essential for business to reduce costs and increase generation speed.

Even with batching, the quadratic attention cost per request remains. Batching amortizes weight loading and improves throughput, but it does not solve uncontrolled context growth.

Refrences

  1. The Technical User’s Introduction to LLM Tokenization