Click below to find out more about commissioning a study. Commission a Study
The AI Frontier

AI infrastructure decisions are becoming more complex. The performance of large-scale AI systems is no longer determined by model size or GPU count alone. It depends on how models, memory, networking, parallelism, workload behavior and serving architecture interact.

Tolaga Research has developed an AI Infrastructure Simulation Platform to evaluate these trade-offs across training, inference, agentic AI, RAG-enabled applications and optimization scenarios such as quantization, distillation and speculative decoding.

The platform provides a structured way to compare dense and Mixture-of-Experts architectures, assess multi-node training and inference configurations, and quantify the impact of different optimization choices on latency, throughput, fabric traffic, power consumption and cost.

Why This Matters

AI infrastructure is moving from a compute-scaling problem to a system-optimization problem. Frontier models increasingly expose bottlenecks across memory bandwidth, KV-cache capacity, interconnect traffic, expert routing, pipeline bubbles and orchestration latency.

This means that infrastructure choices cannot be assessed using GPU peak FLOPS alone. A configuration that looks efficient on paper may underperform once communication, memory placement, batching, queueing, utilization and workload variability are included.

The simulation platform addresses this problem by modeling the full infrastructure stack, from model and workload inputs through to performance, power, cost and diagnostic outputs.

Platform Structure

The platform is organized around three core stages.

The AI Frontier

1. Inputs

The simulator accepts a broad set of assumptions covering the model, workload and infrastructure environment. These include model architecture, dense versus MoE configuration, layer counts, batch size, sequence length, training or serving setup, tensor parallelism, pipeline parallelism, data parallelism, expert parallelism, cluster topology, GPU nodes, network bandwidth, latency, efficiency, precision, cost and power assumptions.

It can also apply optional workload and optimization modifiers, including RAG, agentic AI workflows, quantization and speculative decoding. For agentic AI, the simulator can model multi-step reasoning loops, tool calls, sub-model calls, external latency and persistent context across iterations. This allows a single modeling framework to evaluate both conventional LLM serving and more advanced AI application patterns.

2. Simulation Engine

The simulation engine estimates compute time, memory pressure, communication overhead and contention effects.

The code structure includes explicit breakdowns for traffic, memory, timing, power and cost. Traffic modeling covers tensor-parallel all-reduce, pipeline communication, data-parallel all-reduce, MoE dispatch and MoE combine traffic. Memory modeling includes weights, optimizer state, gradients, activations, MoE routing buffers, communication buffers and ZeRO-3 prefetch overhead. Timing includes compute, HBM stalls, tensor-parallel communication, pipeline bubbles, exposed data-parallel communication, MoE all-to-all communication and straggler effects.

The platform also includes topology-aware communication modeling. It resolves bandwidth and latency across different physical tiers, including NVSwitch, intra-node, intra-leaf, inter-leaf, inter-spine and inter-super-spine domains. This is important because AI performance can change materially when traffic moves beyond the high-bandwidth local domain into broader cluster fabric.

3. Outputs

The simulator produces outputs across five categories:

Performance: latency, time to first token, tokens per second, requests per second and model FLOPS utilization.

Infrastructure: GPU memory, KV-cache requirements, HBM stalls, communication traffic across GPU interconnects and cluster networks, and bandwidth demand.

Cost and power: cloud GPU cost, hardware amortization, electricity cost, facility power and cost per million tokens.

Diagnostics: warnings, out-of-memory checks, contention effects, collective algorithm selection and topology-aware bottleneck identification.

Scenario comparisons: dense versus MoE, training versus inference, RAG, agentic AI, quantization, speculative decoding and distillation comparisons.

For training scenarios, the platform captures step time, active training FLOPS, dense-equivalent FLOPS, memory consumption, communication overhead, pipeline efficiency, contention effects, power and cost per step. For inference scenarios, it captures total parameters, active parameters per token, GPU count, timing, memory, traffic, latency, throughput, utilization, reservation efficiency, cost per million tokens and p95 latency.

Key Simulation Pathways

Training Simulation

The training pathway models step time, tensor, data, pipeline and expert-parallel communication, activation memory, optimizer memory, pipeline overhead and training throughput. It supports both dense and Mixture-of-Experts (MoE) configurations, including the additional all-to-all communication, expert routing and load-imbalance effects introduced by MoE architectures. This makes it useful for evaluating how scaling strategies affect model training efficiency across multi-GPU and multi-node environments.

The training result structure includes active training FLOPS, dense-equivalent FLOPS, MFU, memory, traffic, power, cost, arithmetic intensity, communication bandwidth demand, contention and pipeline overhead.

Inference Simulation

The inference pathway models prefill, decode, KV-cache growth, batching, latency, throughput, memory access and communication overhead. It also captures dense versus MoE serving behavior, including active parameters per token, expert routing traffic, MoE all-to-all communication and the impact of expert parallelism on latency and throughput.

This is especially important because inference bottlenecks are often different from training bottlenecks. Large-model decode can become constrained by memory bandwidth and KV-cache capacity before raw compute is fully utilized, while MoE inference can introduce additional networking and coordination overhead even when fewer parameters are active per token.

Agentic AI and RAG

The platform includes explicit support for agentic workloads and retrieval-augmented generation. Agentic workloads are modeled as multi-loop processes with variable reasoning steps, tool calls, sub-model calls, persistent KV-cache state, external latency and bursty traffic behavior.

RAG modeling includes retrieval shape, top-k chunks, retrieved context tokens, embedding latency, vector search latency, re-rank latency, context compression, orchestration latency, retrieval network traffic, vector database memory and cache hit rates.

This matters because agentic and RAG-enabled systems introduce orchestration and memory behavior that are not captured by conventional tokens-per-second benchmarks.

Scenario Sweeps

The simulator also supports structured scenario sweeps, including quantization and speculative decoding. The code includes quantization assumptions for fp16, int8 and int4, along with speedup factors and task-complexity penalties. It also models speculative decoding using acceptance rates, draft model cost ratios and practical speedup limits.

These sweeps allow users to evaluate where optimizations reduce cost and latency, and where they introduce capability, quality or infrastructure trade-offs.

Research Value

The platform complements, rather than replaces, empirical benchmarking and production validation. It is designed to support early-stage planning, scenario comparison and bottleneck analysis by linking workload assumptions to infrastructure behavior. Deployment decisions can then be refined using measured performance data from the target hardware, software stack and operating environment. It helps answer questions such as:

How does MoE change communication traffic and latency compared with dense models?

When does inference become memory-bound rather than compute-bound?

How do KV-cache, batch size and sequence length affect serving economics?

How much do RAG and agentic loops increase latency and GPU reservation inefficiency?

Where do quantization or speculative decoding reduce cost without materially degrading capability?

Which parts of the cluster fabric become bottlenecks as workloads scale?

By linking workload assumptions to infrastructure behavior, the platform provides a more realistic view of AI system economics than simple GPU-count or peak-FLOPS comparisons.

From Infrastructure Complexity to Decision Intelligence

AI infrastructure performance is increasingly determined by the interaction between compute, memory, networking, model architecture and orchestration. The Tolaga Research AI Infrastructure Simulation Platform provides a structured way to evaluate these interactions across training, inference, RAG, agentic AI and optimization scenarios.

Its core value is not simply estimating speed or cost. It helps identify why a given configuration performs as it does, where bottlenecks emerge and how changes in model architecture or workload behavior affect infrastructure requirements.

As AI systems become more distributed, more memory-intensive and more agentic, full-stack simulation becomes an important tool for understanding both technical performance and unit economics.