The AI Inference Stack Is Starting to Look Familiar

Written by

Joel Pacheco Gonçalves

30 Jun, 2026 5 minutes

The infrastructure build-out for AI has dominated every conversation this year. But a quieter shift is already underway — and it’s happening inside buildings we already own.

In the first piece of this series, we looked at how AI was rewriting the infrastructure planning cycle — longer power lead times, new site selection criteria, a different pace of investment. In the second, the argument was that inference is a network problem as much as a compute problem. Proximity to users matters. Neighbors matter.

This piece takes that one step further. Because the conversation I kept having across industry events this year — from Honolulu to Bellevue — pushed me toward a conclusion I didn’t expect: the infrastructure bottleneck for AI inference isn’t the model or the hardware. It’s whether the places we need to run it are actually ready.

When Agents Start Asking the Questions

Most people still picture AI inference as a person typing a prompt and waiting a few seconds for a response. That framing made sense a year ago. It doesn’t capture what’s coming.

Anyone working with extended thinking models today knows the wait has already stretched. Opus runs complex reasoning tasks over minutes. Fable can take longer. Users have adapted — the response is better, the wait is part of the deal. That’s fine for a person making one request.

Agents don’t make one request. They chain them. One inference triggers another, which triggers an action, which triggers another inference. A coordinated multi-agent system makes hundreds of calls where a single user made one. The volume multiplies. And critically, the traffic pattern changes. Networks were built for asymmetry — lots of data flowing downstream to users, very little going back. Agents uploading context, pulling results, coordinating with each other — that changes the balance. The keynote at NANOG97 made this point directly: the number of inference requests will grow exponentially, and most of that growth won’t come from humans. It’ll come from agents.

Centralized infrastructure can absorb a lot. But agent-driven traffic at scale is a different load than anything the current architecture was sized for. That’s what moves edge inference from a nice-to-have to a structural requirement.

The Architecture Already Exists. The Infrastructure Doesn’t.

The two-tier model for inference isn’t hypothetical. It’s in production. Distributed deployments have demonstrated cost reductions of 76 to 86 percent for high-volume, short-context workloads compared to routing everything through centralized cloud. The models running at these edge nodes are compact and purpose-built — seven to fourteen billion parameters, quantized to run efficiently on air-cooled hardware. They handle the majority of queries. The harder, longer-context requests route back to centralized compute. It works exactly like content delivery: the edge handles what it can, the origin handles what it can’t.

But here’s where the CDN analogy breaks down. When CDNs scaled in the early 2000s, the physical layer was ready. Colocation facilities were already built, already connected, already powered. Operators dropped servers in and the edge layer grew fast.

The inference edge layer doesn’t have that headstart. A standard colocation rack with a 15-kilowatt ceiling fits roughly three eight-GPU inference nodes — about twelve rack units in a 42-unit cabinet. Poor space utilization by any measure. Worse, inference clusters can’t be spread across rows or separate suites. The nodes operate as a compute unit, connected by high-bandwidth east-west fabric. Break that adjacency and you introduce the coordination overhead the architecture was designed to eliminate. The constraint isn’t just power. It’s the combination of power density, physical placement, and the interconnection fabric that ties it together.

What Comes Next

Building new, purpose-designed edge facilities sounds like the answer. It isn’t — at least not on any timeline that matches the pace of AI adoption. Transformer lead times in the U.S. have extended to as long as four years. Roughly seven gigawatts of planned data center capacity was delayed or canceled in 2026 alone. The supply chain for the equipment that powers dense facilities isn’t ready to move fast.

So the industry finds a different path, and it’s already visible. Models keep getting lighter — quantization lets larger models run on more modest hardware with each generation. Existing facilities get targeted power upgrades at high-value locations rather than ground-up builds. The architecture gets smarter about routing: more tiers, better orchestration between them, less reliance on any single layer to do everything.

The locations that sit at the intersection of those tiers — where edge inference nodes need to exchange traffic with the networks carrying it to users, and where the complex queries need a fast path back to centralized compute — become the strategic chokepoints. Not because of what they host. Because of how they connect.

Content delivery didn’t make the origin server obsolete. It made placement strategic. The same thing is happening to AI infrastructure. The edge inference layer is being built. The supply chain is the drag. And the locations that already have the fiber density, the network relationships, and the carrier presence are starting to look like the connective tissue that will make it possible.

Joel Pacheco Gonçalves

Head of Revenue

The AI Inference Stack Is Starting to Look Familiar

The AI Inference Stack Is Starting to Look Familiar

Joel Pacheco Gonçalves

When Agents Start Asking the Questions

The Architecture Already Exists. The Infrastructure Doesn’t.

What Comes Next

Continue reading

Others Insights articles

Not All Data Centers Are the Same

Data Center Growth Shifts South: MDC Cross-Border Interconnection and Latam´s Infrastructure Moment

International Fiber Crossings: The Foundation of Cross-Border Network Control

Subscribe to our Newsletter