Inference Decode Is Not Just a Compute Problem

AI infrastructure performance is often framed as a compute problem: more FLOPS should deliver faster inference, lower costs, and greater serving capacity.

For token-by-token decoding, that assumption is incomplete.

Tolaga Research used its AI Infrastructure Simulation Platform to examine how a 70B-parameter model behaves across different batch sizes and sequence lengths. The simulator models compute, memory bandwidth, KV-cache growth, tensor-parallel synchronization, and workload concurrency.

The platform does not replace empirical benchmarking or production profiling. It provides a structured way to identify where bottlenecks emerge and how they shift as workloads change.

The analysis shows that inference decoding is often constrained less by arithmetic throughput than by memory bandwidth, memory capacity, KV-cache movement, and synchronization. The compute may be available, but the system cannot always move data quickly enough to use it.

This matters because inference differs from the compute-intensive workloads that shaped traditional accelerator design. It is latency-sensitive, often operates at modest batch sizes, and repeatedly accesses model weights and cached state as each token is generated.

Memory-to-compute transition during inference decoding

What the simulation shows

The figure above shows how arithmetic intensity changes with batch size for short-, medium-, and long-context decoding workloads.

At low batch sizes, decoding is predominantly memory-bound. There is too little arithmetic work per byte of data moved to keep the accelerator's compute engines fully occupied.

Arithmetic intensity rises as batch size increases because more requests share access to the model weights. Short-context workloads can approach compute-bound operation near a batch size of approximately 256.

Medium- and long-context workloads remain memory-bound for longer. Their larger KV caches consume memory capacity, increase data movement, and limit the number of requests that can be processed concurrently.

The result is a growing gap between theoretical compute capacity and deployed inference performance.

Why decode is different from prefill

Inference has two main phases: prefill and decode.

Prefill processes the input prompt. Because many input tokens can be handled in parallel, it can make relatively efficient use of accelerator compute.

Decode generates tokens sequentially. Each token depends on the state created by earlier tokens, limiting parallelism and requiring repeated access to model weights and the KV cache.

Decode performance is therefore often governed by memory bandwidth, cache efficiency, and interconnect behavior rather than peak FLOPS.

A hospital analogy

A hospital equivalent is a surgical team waiting for patient records, scans, or laboratory results.

The operating theatre is available, the team is ready, and the equipment is in place. But the procedure cannot proceed until the required information arrives.

In inference decoding, the compute engines are ready, but data movement determines how effectively they can work.

Why batch size matters

Batching improves utilization by allowing multiple requests to share model-weight access. This increases arithmetic intensity and can move a workload toward compute-bound operation.

The trade-off is latency. Requests may need to wait while a batch forms, which is often unacceptable for chatbots, coding assistants, agents, and other interactive applications.

The batch sizes that maximize hardware utilization may be larger than those acceptable for a responsive user experience.

Short requests are generally easier to batch. Long-context and agentic workloads support fewer concurrent requests, making efficient batching more difficult.

Why long-context inference is especially difficult

As context length increases, the KV cache grows, consuming more GPU memory and increasing memory traffic during decoding.

A small number of long-context requests can occupy as much memory as many short requests. This reduces concurrency, limits practical batch sizes, and can eventually create out-of-memory constraints.

This is why document analysis, research synthesis, report generation, and agentic workflows can be disproportionately expensive to serve. They change the balance between compute, memory, and latency.

Implications for AI infrastructure design

Peak FLOPS alone are a poor measure of inference performance. Memory bandwidth, memory capacity, cache management, scheduling, and interconnect design can be equally important.

Higher HBM bandwidth reduces stalls by moving weights and KV-cache data more quickly. Larger on-chip memory can keep frequently used data closer to the compute units.

KV-cache optimization, including compression, paging, reuse, and improved placement, can reduce memory pressure as sequence lengths and batch sizes increase.

Memory-aware scheduling can improve locality, reduce idle cycles, and use available bandwidth more effectively. Interconnect performance also becomes critical when inference is distributed across multiple devices.

Alternative accelerator architectures offer another approach. Groq's Language Processing Unit (LPU), for example, uses substantial on-chip SRAM and deterministic data movement to reduce dependence on external memory for models that fit within its architecture. Tolaga Research will examine this trade-off in a future simulation study.

More broadly, accelerator design is beginning to treat memory movement as a first-order constraint rather than a secondary implementation detail.

Investor implication

The hardware race is not simply a contest for the highest peak FLOPS.

In deployed inference, superior memory bandwidth, cache architecture, scheduling, and data movement can outweigh an advantage in theoretical compute. This is particularly important for low-latency decoding, long-context workloads, and agentic systems.

Leadership in training performance may not translate into leadership in inference. As inference workloads grow, vendors that reduce memory stalls, improve KV-cache efficiency, and increase effective utilization may achieve better performance and lower serving costs.

Strategic takeaway

The boundary between memory-bound and compute-bound inference changes how AI infrastructure should be evaluated.

Short-context workloads can approach compute-bound operation at sufficiently high batch sizes. Many real-world applications, however, cannot batch aggressively without increasing latency. Long-context and agentic workloads face the added constraint of growing KV-cache requirements.

As these workloads become more common, memory bandwidth, cache management, interconnect design, and workload-aware scheduling will play a larger role in determining cost and performance.

Inference performance is not just a compute problem. It is also a memory movement problem.