your system language is:English

How OpenAI and Anthropic Build Great AI Products

Cover

📺 Today’s recommended deep-dive video: https://www.youtube.com/watch?v=IxkvVZua28k


The Intelligence Interface: Redefining Product Management at OpenAI and Anthropic

Two of the most influential product minds in tech—the co-founders of Instagram—have traded social feeds for large language models. They discuss how product development is fundamentally changing when the underlying technology moves faster than the code, shifting the focus from UI design to the rigorous art of “evals.”

Core Question: How do you build stable, delightful products on top of non-deterministic, rapidly evolving artificial intelligence?

Highlights

  • Product Management is shifting from UI design to the specialized technical skill of writing and grading model “evals.”
  • The 60% Success Threshold: Why products can be economically valuable long before they reach perfect accuracy.
  • Scaling Intelligence: The shift from “System 1” instant responses to “System 2” reasoning-time compute.
  • The rise of “computer use” and proactive assistance as the next major shifts in the human-computer interface.

⏱️ Reading time: approx. 5 minutes · Saves you about 36 minutes vs. watching.

Want to take notes while watching? Click the image below and let AI Notebook capture the key points for you 👇

AI Notebook


The New Product Frontier

Navigating the Moving Technology Base

Transitioning from social media to the AI frontier means trading stable software platforms for a reality where computers gain world-changing capabilities every few months. In a traditional software role, you build on a fixed technology base, but in AI, the “ground” is constantly shifting beneath your feet as new emergent properties appear in the models.

Product development in this environment feels like peering through a thick mist, trying to discern which capabilities are real and which are merely statistical artifacts of the training process. You might start training a model thinking it will excel at task X, only to find the research team has accidentally unlocked task Y three months prior without realizing its commercial significance. This stochastic nature forces a radical departure from the rigid, 60-day roadmap planning common in Enterprise software, favoring instead a philosophy of discovery and “Zen-like” flexibility.

Success in this space requires a childlike delight in learning combined with the stamina to handle a sleepless, high-stakes iteration cycle.

Flowchart comparing Traditional Software Development (Fixed Tech -> UI Design -> User) vs. AI Product Development (Research -> Emergent Capability -> Evals -> Stochastic Output -> User)

💡 Digging Deeper

Q: How does the Enterprise feedback loop differ from consumer social apps?
A: Enterprise users have a much higher financial incentive to tell you exactly where your product sucks, whereas consumer feedback is often gathered through aggregate data science.

Q: Can you plan a roadmap in this environment?
A: You can “squint” to see the general slope of intelligence, but specific capabilities are often late-breaking discoveries rather than planned milestones.


The Rise of the Eval-Led PM

Bridging the Gap Between Research and Utility

Writing and grading evaluations—or “evals”—has rapidly evolved into the most critical skill for any product manager attempting to build on top of large language models. Internally, the distinction between “research PMs” and “surface PMs” is disappearing, as the quality of a feature is now gated entirely by how well the team can define and measure success.

The industry is currently hitting a “Mendoza Line” of roughly 60% success for many complex agentic tasks. While a 60% success rate sounds low for traditional software, it is often more than enough to create massive economic value, provided the product is designed with a graceful “human-in-the-loop” failure state. For example, GitHub Co-pilot was built on early models that were far from perfect, yet it saved developers enough typing time to become an essential tool.

Models are frequently not limited by their inherent intelligence, but rather by the quality of the evals used to teach them specific, high-value behaviors.

Bar chart showing "Task Economic Value" vs. "Model Accuracy," highlighting the 'Utility Zone' where human-AI collaboration succeeds even at 60-80% accuracy

💡 Digging Deeper

Q: How do you develop intuition for writing good evals?
A: Use the models themselves to critique your prompts and look deeply at raw data; nothing beats manually reviewing the cases where a model fails.

Q: What happens when models become smarter than the humans grading them?
A: This is a looming challenge; we are moving toward “softer” grading where we judge if a model met a competent expectation rather than just checking if a math answer is right or wrong.


Agents, Reasoning, and Computer Use

Moving From Completion to Co-Thinking

We are currently entering the “GPT-1 phase” of a new form of intelligence scaling known as reasoning-time compute, or System 2 thinking. Instead of the model giving an instant, “System 1” gut reaction, it pauses to form hypotheses, refutes them, and iterates internally before providing a response. This allows the model to tackle scientific breakthroughs and complex puzzles that require more than just simple text completion.

Beyond just thinking, the next horizon is “computer use,” where models interact with the same UIs that humans do to eliminate digital drudgery. Imagine a model that doesn’t just write a PRD, but opens your browser, navigates to your internal finance tools, and fills out the repetitive forms that usually take thirty clicks to complete. This transition from “chatbot” to “agent” will require users to become comfortable with asynchronous interactions, where you give a command and wait an hour for a comprehensive result.

This shift will fundamentally break the 25-year-old intuition that computers should provide immediate, deterministic outputs for every input.

Architecture diagram showing the flow of a 'Reasoning Model' (Input -> Internal Hypothesis Loop -> Verification -> Final Output) vs a 'Traditional LLM' (Input -> Immediate Token Stream)

💡 Digging Deeper

Q: What is the most surprising use of “computer use” so far?
A: Using the AI to perform UI testing; models are excellent at determining if a button move broke the intended user flow without needing brittle, hand-coded scripts.

Q: How should teams use models like o1 versus GPT-4o?
A: Think of them as an orchestration; use faster models for simple tasks and “reasoning” models for workflows that require deep logic and verification.


Key Takeaways

Building in the AI era requires a fundamental shift in how we perceive the “personality” of software. Unlike the static buttons of the mobile era, AI models possess distinct temperaments—some are smarter but distant, others are empathetic but less capable—and managing this “Model Behavior” has become a core product role. As models begin to remember our past interactions and offer proactive help, the relationship between human and computer is becoming increasingly interpersonal.

The speed of user adaptation remains the most shocking variable in the equation. Just as we quickly became bored with the magic of self-driving cars, the “mind-blowing” AI capabilities of today will be considered “garbage” twelve months from now. We are moving toward a future of universal translators and proactive agents that anticipate our needs before we even ask, permanently altering the way the next generation interacts with the world.


Q&A

Q1: How do you handle the non-deterministic nature of AI in product design?
A: You have to get “Zen” about letting go of total control. Instead of fixing every bug, you build feedback loops and guardrails that help the model self-correct when it goes astray.

Q2: What is the most important skill for a PM to learn in 2025?
A: Prototyping with the models themselves and going deeper into the technical stack to understand how post-training and fine-tuning influence the final user experience.

Q3: Is AI intelligence-limited or eval-limited right now?
A: In many cases, it is eval-limited. Models often have the latent intelligence to solve a problem but haven’t been taught how to apply it to a specific task through rigorous evaluation.

Q4: How will “proactivity” change our daily workflows?
A: Models will shift from reactive boxes to proactive assistants that read your email (with permission) and prepare research or drafts for your next meeting before you even open your laptop.

Q5: How are children interacting differently with these models?
A: Kids are “AI-native”; they don’t just consume content, they co-create it, asking models to generate custom stories and images in real-time, treating the AI as a collaborative partner.

Q6: What is the “Star Trek” moment for AI right now?
A: Advanced voice mode acting as a seamless, real-time universal translator, allowing two people with no common language to hold a complex business conversation.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts