AI Inference At Scale: Interview With Base 10 CEO

Cover

📺 Today’s recommended deep-dive video: https://www.youtube.com/watch?v=XAbKflCncDo

Scaling the Frontier: Tuhans Srivastava on the Future of AI Inference and the Compute Crunch

As AI moves from experimental labs to massive production workloads, the infrastructure layer is hitting a wall of capacity constraints. Base 10 CEO Tuhans Srivastava explains why custom models are winning the market and how a 30x growth trajectory looks from inside the engine room of the AI revolution.

Core Question: How will companies navigate the extreme supply crunch of compute to build specialized, high-performance AI applications?

Highlights

Why 95% of high-scale inference tokens are moving toward custom-trained models rather than vanilla open source.
The brutal reality of the GPU supply crunch and why “slack compute” no longer exists in the current market.
The strategic link between post-training customization and inference performance in a competitive landscape.
Why the “application layer” survives by capturing unique user signals that frontier labs cannot easily access.

⏱️ Reading time: approx. 8 minutes · Saves you about 35 minutes vs. watching.

Want to take notes while watching? Click the image below and let AI Notebook capture the key points for you 👇

The Transition to Custom AI

Beyond Vanilla Open Source

The era of simply calling a general-purpose API is rapidly giving way to a more sophisticated, sovereign approach to machine intelligence.

While closed-source models still hold the frontier for general reasoning, Srivastava notes that 95% of Base 10’s workload consists of custom models. These aren’t just vanilla weights downloaded from Hugging Face; they are specialized tools refined with proprietary data to perform specific tasks with surgical precision.

Businesses are realizing that their competitive moat isn’t the model itself, but the unique “user signal” they capture through their specific software workflows. By integrating AI into clinician notes or customer support tickets, companies like Abridge and Decagon create a feedback loop that frontier labs cannot easily replicate. This signal allows for post-training that makes the model faster, cheaper, and more accurate for a narrow, high-value domain, effectively insulating the application layer from being “eaten” by the likes of OpenAI or Anthropic.

A process map diagram showing the 'User Signal Feedback Loop': User interacts with application -> application captures unique signal/data -> data is used for post-training/fine-tuning -> customized model is deployed back into the inference cloud -> improved performance/lower cost leads to more user engagement.

💡 Digging Deeper

Q: Why can’t frontier labs just build what application companies are building?
A: They lack the deep integration into specific workflows, such as a physician’s EMR (Electronic Medical Record) system, which provides the rare “reward signal” needed to train specialized models.

Q: Is the enterprise market finally coming online for AI?
A: By inference count, 99% of the market is still AI-native startups, meaning the massive wave of traditional enterprise adoption is still ahead of us.

Q: What is the primary driver for choosing a model today?
A: While cost is a factor, most high-growth companies prioritize capability first because that is where the economic growth is unlocked; optimization comes only after the value is proven.

The Infrastructure Reality Check

Navigating 18 Clouds

Base 10 has built a “runtime fabric” that spans 18 different clouds and 90 clusters globally to find the compute their customers require.

The supply crunch is not just a narrative; it is a physical reality where “slack compute” has effectively disappeared from the market. Base 10 frequently operates at utilization rates in the mid-90s, a level of efficiency that would be uncomfortable for most traditional software companies.

Managing this requires a shift from pure software engineering to a culture of intense operations. Srivastava describes a “standing 4:00 PM meeting” dedicated solely to capacity management, highlighting that in a world of constrained compute, the ability to simply secure and operationalize hardware is a primary strategic advantage. This has changed the very nature of the business, introducing complex working capital requirements and long-term contract structures that look more like heavy industry than traditional SaaS.

An architecture diagram showing the Base 10 'Runtime Fabric': A central control plane connects to 18 disparate cloud providers (AWS, GCP, and specialized GPU clouds) through a unified runtime layer that handles failover, latency optimization, and reliability across 90 global clusters.

💡 Digging Deeper

Q: How long are the contract lengths for the newest chips?
A: For high-demand chips like the B200, suppliers are often demanding 3-to-5-year contracts with significant upfront prepayments.

Q: Is GPU-as-a-service a viable long-term business?
A: Raw GPU access is a commodity; the “stickiness” comes from the software layer—the inference cloud—that manages the complexity of the models.

Talent and the Multi-Chip Future

The Link Between Training and Inference

The acquisition of the PaLM research team was a strategic move to bridge the gap between how a model is trained and how it is served.

Inference and post-training are two sides of the same coin. Decisions made during training, such as quantization techniques, directly dictate how efficiently a model will run in production. By owning both ends of this loop, Base 10 can help customers move faster from “pre-product market fit” to “at-scale optimization.”

The Nvidia Moat

While the industry hopes for a multi-chip future, the reality is that Nvidia’s dominance is built on an ecosystem that is incredibly difficult to displace.

CUDA and the surrounding developer tools allow infrastructure companies to move at a speed that alternative chip providers simply cannot match. Srivastava points out that many competitors sabotage their own ecosystems by tying up 90% of their supply with a single buyer, preventing the broader community from building the necessary software libraries. For now, the “vegan inference” (non-Nvidia) remains a niche compared to the massive momentum of the H100 and its successors.

A comparison table between Nvidia and Emerging Chip Rivals. Rows: Software Ecosystem (CUDA vs Proprietary), Supply Availability (High vs Restricted), Developer Momentum (High vs Low), and Time-to-Market (Fast vs Slow).

Key Takeaways

The AI market is undergoing a fundamental shift from general-purpose APIs to specialized, in-house intelligence. Companies that win in this era won’t just be the ones with the largest models, but those that can capture unique data signals and translate them into custom, high-performance inference workflows. This “sovereign AI” approach allows developers to own their margins and their intellectual property.

Operationally, the “compute crunch” has transformed AI infrastructure into a game of logistics and capital efficiency. Success now requires navigating a fragmented global supply chain of cloud providers while maintaining extreme utilization rates. The ability to abstract this complexity away through a unified software layer is what defines the next generation of cloud services.

Ultimately, the goal is to trigger Jevons Paradox: by making intelligence cheaper and more specialized, we won’t consume less of it—we will embed it into every workflow. We are moving toward a world of “concierge everything,” where personalized agents assist with healthcare, education, and professional tasks, fundamentally increasing the total amount of software in existence.

Q&A

Q1: How much of Base 10’s workload is vanilla open-source models?
A: Less than 5%. Almost every customer at scale is running a modified or post-trained version of a model like Llama or Mistral to fit their specific needs.

Q2: What is the “User Signal” and why does it matter?
A: It is the data generated by users interacting with an application. Because frontier labs don’t have access to these private interactions, they can’t train models that are as specialized as those built by the application owners.

Q3: Is there any “slack” left in the GPU market?
A: Virtually none. Large clusters are running at mid-90% utilization, and finding new capacity requires looking across dozens of different cloud providers globally.

Q4: Why is post-training talent so strategic right now?
A: Because post-training and inference are linked. How you train and quantize a model determines its latency and cost in production, making research expertise a core part of infrastructure.

Q5: What keeps a CEO of an inference cloud up at night?
A: Capacity. In a market this large and fast-moving, the biggest risk is not being aggressive enough in securing the hardware needed to meet demand.

Q6: Will there be fewer software engineers in the future?
A: No. We will likely just build a ton more software. AI tools mean we can tackle more problems, not that we will stop wanting new solutions.

Q7: What is “Jevons Paradox” in the context of AI?
A: It’s the idea that as we make intelligence cheaper and more efficient, the total demand for it will skyrocket rather than decrease, leading to AI being embedded everywhere.

TeraBox Blog | 1TB Free Cloud Storage & All-in-One AI Space

AI Inference at Scale: Interview with Base 10 CEO

Scaling the Frontier: Tuhans Srivastava on the Future of AI Inference and the Compute Crunch