The AI Frontier

Key Takeaways

  • AI is evolving into a layered, coordinated system that has important similarities to the structure of modern healthcare.
  • AI competition is shifting from model scale to infrastructure efficiency.
  • Moving data is becoming as important as processing it.
  • Optimization is rapidly reducing the cost of intelligence and diminishing the moat around frontier models.
  • Agentic AI shifts the bottleneck from computation to coordination.

History shows that technology pioneers are often not the long-term winners. Successful markets attract waves of investment, innovation, and competition that can overwhelm early leaders. Netscape was eclipsed by Microsoft and later Google, while MySpace gave way to Facebook.

Tech investors recognize these same dynamics in Artificial Intelligence (AI) today. While the opportunity is enormous, the pace of innovation is relentless. Competition is intense, market positions are shifting rapidly, and companies across virtually every industry are racing to secure their place in the emerging AI ecosystem.

How AI will ultimately evolve remains uncertain. To explore the possibilities, we use healthcare as a framework for understanding the AI landscape. While no analogy is perfect, it provides a useful lens to examine the roles, dependencies, bottlenecks, and competitive dynamics emerging across the AI value chain. Using this framework, we investigate several common blind-spots relating to compute, connectivity, memory, and orchestration, and consider how these factors may influence the competitive positioning of AI infrastructure and platform providers as the market evolves.

In related research notes, we use the Tolaga AI Infrastructure Simulator to examine how key AI innovations affect infrastructure cost, performance, and efficiency.

AI Infrastructure Stack

Part One: The Analogy

Modern healthcare manages complexity through a layered system of GP clinics, regional care centers, major hospitals, specialists, and care coordinators. Expertise is applied selectively, escalating resources only when needed and coordinating specialized capabilities at scale. Frontier AI infrastructure mirrors this approach, providing a useful framework for understanding how AI systems allocate intelligence, manage cost, and where bottlenecks are emerging.

The GP Clinic: Small Language Models

Most healthcare interactions are handled by GP clinics, with only the more complex cases escalated to specialists or major hospitals. Small Language Models (SLMs) play a similar role in AI, handling everyday workloads efficiently while reserving frontier models for tasks that require greater reasoning capability.

SLMs can be deployed locally, at the edge, or as part of larger AI platforms. Their advantages are lower cost, lower latency, and greater deployment flexibility.

Like GPs, however, SLMs have limits. While increasingly capable, the most complex reasoning and long-horizon tasks may still require escalation to frontier models. In both healthcare and AI, the challenge is determining when escalation is needed and ensuring it occurs efficiently.

The Regional Clinic: Shared Infrastructure for Intermediate Workloads

Not all workloads can be handled efficiently by small models, nor do they all justify dedicated hyperscale infrastructure. Between these extremes, an intermediate layer of AI infrastructure is emerging to balance capability, cost, latency, and scale.

Like regional healthcare networks, this layer provides shared resources that bridge routine and highly specialized care. It includes specialized AI cloud providers such as CoreWeave, Nebius, and Lambda, as well as distributed infrastructure from telecommunications providers such as Nokia and Ericsson.

This layer enables organizations to access advanced AI capabilities, including frontier models, without the cost and complexity of building and operating hyperscale infrastructure themselves.

The Major Hospital: Frontier models

The major hospital is the central coordinating hub of the healthcare system. It integrates across specialties, maintains a comprehensive view of the patient, and provides the highest level of expertise for complex cases. Frontier models, such as the largest variants of GPT, Claude, Gemini, Llama, and Grok, play a similar role. They maintain coherence across massive contexts and provide general-purpose reasoning across a wide range of tasks.

The tradeoff is resource consumption. Delivering frontier-scale reasoning is computationally expensive, regardless of the underlying architecture. As with major hospitals, the challenge is not their value but their efficient utilization. Frontier models remain essential for the most demanding workloads, but applying their full capabilities to routine tasks quickly becomes uneconomic at scale.

The Specialist Departments: Mixture of Experts

Hospitals scale expertise through specialization; not every patient needs to see every specialist. Modern AI systems built on Mixture-of-Experts (MoE) architectures apply a similar principle. Systems such as Mixtral, Gemini MoE variants, Grok, and DeepSeek-MoE contain numerous expert subnetworks but activate only a small, relevant subset per inference. This expands total model capability without proportionally increasing active compute requirements.

Like hospital specialists, however, MoE experts do not operate in isolation. They share context, representations, and coordination infrastructure. Crucially, specialization introduces a new bottleneck. Just as complex referral networks create scheduling and administrative overhead, MoE systems generate substantial routing, synchronization, and all-to-all network traffic across GPUs.

Care Coordination: Agentic AI

Complex medical cases require a coordination layer to determine which specialists to engage, when to request tests, and how to sequence care over time. Agentic AI systems play a similar role. Rather than merely responding to a single prompt, agents decompose goals, route tasks to specific tools or sub-agents, retrieve external data, and manage multi-step workflows.

As AI becomes more agentic, the primary challenge shifts from individual model capability to system-wide execution reliability. Success depends on maintaining coherence across long-running workflows, managing dependencies, handling failures, and allocating resources efficiently.

External Systems: Tools, Retrieval, APIs

No hospital operates in isolation; it relies on an interconnected ecosystem of laboratories, imaging centers, pharmacies, and referral networks. AI systems are evolving similarly. Retrieval systems, vector databases, enterprise APIs, and software applications increasingly function as supporting infrastructure around the core model. Platforms such as Pinecone, Weaviate, and Qdrant provide dedicated retrieval and long-term memory capabilities, while orchestration platforms bridge models to enterprise data, tools, and workflows.

As in healthcare, overall system effectiveness increasingly depends on how well its components work together. Frontier AI is no longer defined by model capability alone, but also by how effectively it accesses information, invokes tools, and coordinates resources across a broader ecosystem.

AI Frontier - Healthcare Analogy

Where the Analogy Breaks

Taken together, AI systems are evolving from monolithic models into coordinated networks of specialized components. While structurally useful, the healthcare comparison has several important limitations:

Physical Geography

Healthcare architectures are fundamentally anchored by physical geography; patients and providers must exist in the same physical world. AI architectures are far more flexible. While geography matters under specific conditions such as data sovereignty, data locality, and latency requirements, AI systems are primarily organized around capability, economics, and performance rather than physical proximity.

Statefulness vs. Statelessness

Healthcare systems are deeply stateful. Patients accumulate persistent medical records and identities over time. In contrast, most modern AI systems remain largely stateless, operating with restricted context windows and limited long-term continuity between sessions.

Opaque vs. Auditable Specialization

Human specialists have explicit credentials and decision processes. MoE experts do not. Their specializations emerge during training and remain difficult to inspect or explain.

Accountability and Governance

Clinical environments operate within strict frameworks of licensing, liability, and documented referral chains. AI systems largely lack equivalent governance mechanisms. However, this may change for AI architectures to operate safely in high-stakes domains.

Part Two: The Implications

While the hospital analogy has its limitations, it provides a useful framework for understanding how AI infrastructure is evolving and highlights several common blind spots relating to compute, connectivity, memory, orchestration, and competitive positioning.

Blindspot: Compute is not always the bottleneck

What many people think

AI performance is primarily about compute. More FLOPS means more performance.

Why people think this?

Historically, AI infrastructure was optimized around training workloads, where large batch sizes and massive matrix multiplications made raw compute throughput the dominant constraint.

What is actually happening

Frontier AI inference, particularly at low latency, is often memory-bandwidth-bound rather than compute-bound. GPUs frequently complete arithmetic operations and then wait for model weights, KV cache data, memory transfers, or synchronization across tensor-parallel devices. As a result, expensive compute resources can sit partially idle while data movement becomes the real bottleneck.

The hospital equivalent is a surgical team waiting for patient records or imaging scans before surgery can begin. The theatre is ready, the staff are present, but the procedure is delayed because critical information has not yet arrived.

When Memory not Compute Limits AI Inference

Implication for Investors and Practitioners

The hardware race is not just about peak FLOPs. In real inference workloads, a chip with significantly higher memory bandwidth can outperform one with greater theoretical compute. This gap between benchmark performance and deployed performance is increasingly where custom silicon vendors compete.

See our research note: When Memory not Compute Limits AI Inference

Blindspot: MoE Efficiency Comes with a Networking Tax

What many people think

MoE is more efficient than dense models because only a fraction of parameters activate per token. This is correct as far as it goes.

However

MoE models generate significant communication traffic because tokens must be routed to experts distributed across multiple GPUs. Within a node, and in some systems across a tightly coupled rack-scale GPU domain, technologies such as NVLink and NVSwitch can reduce this overhead. The challenge becomes more difficult as workloads span multiple nodes, racks, or clusters, where all-to-all communication, synchronization, routing overhead, and expert load imbalance can become major sources of latency, sometimes eroding the theoretical efficiency gains that make MoE architectures attractive in the first place.

This exposes a broader infrastructure reality: at frontier scale, AI performance depends not only on GPU performance, but also on how efficiently data moves within servers, across racks, between data centres, and ultimately to users.

The hospital equivalent is a specialist referral network. Referrals can improve outcomes by directing patients to the most appropriate expertise, but they also introduce coordination, scheduling, and handoff overhead. As these interactions scale, the referral process itself can consume a meaningful share of total care time. At sufficient scale, a poorly coordinated specialist network can become slower than a well-run general hospital despite having superior expertise.

MoE efficiency gains are real, but they depend heavily on the surrounding infrastructure. Performance is shaped as much by routing efficiency, synchronization overhead, and interconnect topology as by the architecture itself. Poorly optimized MoE deployments can underperform dense models despite their theoretical compute advantages, while the largest gains increasingly rely on custom networking fabrics and topology-aware orchestration.

MoE networking tax and all-to-all communication overhead

See our research note: When Experts Become the Bottleneck.

Networking Is Becoming Core AI Infrastructure

The implication extends beyond MoE. Distributed training, large-scale inference, RAG systems, agentic workflows, and edge deployment all depend on fast, reliable movement of data across the AI stack. Networking is no longer a secondary layer beneath compute. It is becoming a primary determinant of latency, throughput, utilization, and cost.

The challenge also extends beyond the data centre. User experience depends not only on model capability but also on the path between the inference system and the user. In some cases, smaller models deployed closer to users can deliver a better experience than more capable models running in distant centralized facilities.

For investors, this suggests that networking infrastructure may be materially underappreciated relative to GPUs. As AI systems become more distributed, companies providing high-bandwidth interconnects, low-latency fabrics, and advanced networking architectures are developing durable competitive advantages that are difficult to replicate.

Blindspot: Inference Is Not a Single Market

Much of the discussion around AI infrastructure focuses on the distinction between training and inference. While the two have different requirements, an equally important shift is occurring within inference itself. The market is fragmenting into workloads with very different economic and architectural needs.

Regime A: High-frequency, low-complexity inference

Consumer chat, coding assistants, API inference, and real-time agentic tasks prioritize throughput, latency, utilization, and low cost per token. These workloads increasingly rely on distilled and quantized models, speculative decoding, custom silicon, and regional or edge deployments closer to users. In healthcare terms, this resembles GP clinics and regional hospitals: high-volume systems optimized for speed, efficiency, and accessibility.

Regime B: Low-frequency, high-complexity inference

Research synthesis, scientific discovery, legal analysis, and deep reasoning workloads prioritize capability over cost. These systems optimize for reasoning depth, large context retention, reliability, and frontier-scale infrastructure. This is the equivalent of specialist surgery or tertiary care: expensive, centralized, and reserved for the most complex cases.

The implication is that AI infrastructure is no longer converging toward a single architecture. It is fragmenting into layered systems optimized for fundamentally different operational and economic realities.

For investors, Regime A and Regime B may appear similar but increasingly operate under different economic models. Regime A rewards efficiency, latency, and scale economics, while Regime B rewards reasoning depth, capability density, and outcome quality. Treating them as a single market risks mispricing both.

Blindspot: AI Economics Are Being Reshaped by Optimization

Most discussion around AI focuses on model architectures and GPUs. In practice, however, a growing set of optimization techniques is reshaping inference economics. Quantization, distillation, speculative decoding, and custom silicon can dramatically improve cost, latency, throughput, and infrastructure efficiency, often without requiring larger or more powerful models.

Quantization: Compressing Model Weights to Lower Inference Costs

Quantization reduces the precision of model weights, typically from 32-bit to 8-bit or 4-bit formats. By reducing the amount of data that must be stored and moved through the system, it lowers memory requirements and improves inference efficiency. Modern quantization techniques preserve most model quality while delivering meaningful improvements in memory efficiency, latency, and cost. More aggressive compression, such as 4-bit quantization, can deliver further efficiency gains but may introduce tradeoffs on complex reasoning tasks.

The hospital equivalent is using a concise patient summary rather than storing a complete medical record in every room. For most consultations, the summary contains enough information while significantly reducing storage and retrieval overhead.

MoE networking tax and all-to-all communication overhead

See our research note: Trading Precision for Efficiency with Quantization.

Distillation

Model distillation transfers selected capabilities from larger frontier models into smaller, cheaper deployment models. It is not a general replacement for frontier AI, but it can materially improve inference economics when the workload is narrow, repeatable, and measurable.

The strongest opportunity is in tiered inference. Routine and well-understood requests can be served by smaller distilled models, while complex, ambiguous, high-risk, or high-value cases are routed to larger frontier models. In this architecture, the value comes not just from the smaller model, but from the routing, evaluation, monitoring, and escalation system around it.

The economics can be significant, but they depend on routing quality. Based on illustrative results from Tolaga's AI Infrastructure Simulator, the distilled 7B model reduces latency to roughly 0.58 to 0.69x of the 70B baseline, implying about 1.45 to 1.73x faster response times. Throughput improves by a similar 1.45 to 1.73x, while modeled serving cost falls by approximately 1.7 to 12.3x across the workloads tested. However, these savings depend on how much traffic can safely remain on the smaller model. If a meaningful share of requests still needs to escalate to a 70B or frontier model, the effective cost advantage narrows.

Inference economics and tiered model
             deployment

Public benchmark results also show the limits of distillation. Smaller distilled models can retain strong performance on selected tasks, but performance usually weakens as tasks become broader, more complex, or less similar to the distillation data. This is especially important for coding, scientific reasoning, multi-step reasoning, and unfamiliar edge cases. As a result, workload-specific evaluation is essential before production deployment.

The broader implication is that AI deployment is becoming more layered. Frontier models remain important for new capability creation, synthetic data generation, difficult reasoning, and validation. Distilled models create value when they execute defined workloads more efficiently. Competitive advantage shifts toward organizations that can match each task to the right model tier while maintaining quality, governance, and cost control.

See our research report: Model Distillation and the Diffusion of AI Capability

Speculative decoding

Speculative decoding improves inference efficiency by using a small, fast draft model to predict several tokens ahead. The larger model then verifies these predictions in parallel, reducing the amount of work it must perform. Higher acceptance rates translate directly into lower latency and higher throughput.

The technique is most effective for predictable workloads such as code generation and structured outputs. More creative or open-ended tasks typically produce smaller gains because the draft model's predictions are less likely to be accepted.

Look out for our upcoming Research Note: The Prediction Dividend: Accelerating AI with Speculative Decoding

Custom silicon increasingly targets data movement and memory efficiency. Architectures such as TPUs and LPUs trade flexibility for lower latency, higher efficiency, and more predictable performance.

As a result, the same AI model can produce very different latency, throughput, and cost profiles depending on the underlying architecture, even when theoretical FLOPS are similar. Increasingly, inference performance depends as much on moving data efficiently as on performing computation.

Look out for our upcoming LPU research.

The common thread is that all four techniques increase capability per dollar of infrastructure investment. In production systems, they are often combined, with distilled and quantized models deployed on memory-optimized hardware and further accelerated through speculative decoding. As AI matures, competitive advantage is likely to depend as much on optimization and operational efficiency as on model scale itself.

Blindspot: Agentic AI Means More than Compute

Many people assume that agentic AI simply increases compute demand. In reality, agentic systems often spend large portions of their time waiting on APIs, retrieval systems, orchestration logic, and subagents while GPUs remain allocated but underutilized.

The key metric becomes effective utilization, or GPU reservation efficiency: the proportion of time expensive infrastructure is performing useful computation.

The hospital equivalent is an operating theatre booked for six hours but performing surgery for only three. The room is occupied and billed for the entire period, even while waiting on lab results, patient transfers, or staff availability.

This matters because the economics of agentic AI are driven less by cost per token and more by cost per completed workflow. In tool-heavy environments, GPU utilization can fall well below 50%, making orchestration efficiency increasingly important.

The implication is that the economics of agentic AI are often misunderstood. Token-based pricing can significantly underestimate the true infrastructure cost of long-running workflows where resources remain reserved but underutilized.

This creates both risk and opportunity. Organizations that rely on naive token economics may struggle, while those that improve orchestration efficiency, reduce tool latency, and increase active GPU utilization could gain a significant competitive advantage.

Inference Economics with Optimization

Look out for our research report: Agentic AI and the economics of waiting

Part Three: What This Means for Investors and Practitioners

For Investors

The AI infrastructure race is no longer just about building larger models. It is increasingly about delivering capability efficiently at scale.

The blind spots discussed throughout this paper suggest that memory bandwidth, networking, orchestration, and systems engineering are becoming as important as raw compute. MoE architectures depend heavily on interconnect quality, while distillation and other efficiency techniques are compressing the economic moat around frontier models faster than many investors appreciate.

At the same time, AI inference is fragmenting into distinct markets with very different economics. High-volume inference rewards efficiency, latency, and scale economics, while frontier reasoning workloads reward capability density and outcome quality. Agentic AI adds another layer of complexity by introducing orchestration and utilization constraints that cannot be solved simply by deploying more GPUs.

Increasingly, some of the most durable competitive advantages may reside in the less visible layers of the stack: memory architectures, networking fabrics, orchestration platforms, and infrastructure efficiency.

For Practitioners

The most important practical shift is to stop optimizing for benchmark performance and start optimizing for workload economics.

For many production deployments, the optimal solution is not the largest frontier model, but a carefully engineered combination of distillation, quantization, orchestration, and hardware optimization that delivers sufficient capability at a fraction of the cost.

Many of these techniques are already production-ready and capable of delivering meaningful structural advantages. At the same time, agentic systems introduce new utilization and orchestration challenges that make workflow efficiency as important as model performance.

Architectural decisions made today may be difficult to reverse tomorrow. MoE systems can introduce additional operational complexity, while dense architectures often remain simpler and more predictable in regulated or high-stakes environments. Increasingly, infrastructure design is becoming as important as the model itself.