
📺 Today’s recommended deep-dive video: https://www.youtube.com/watch?v=yXPPcBlcF8U
Models Are What They Eat: The Frontier of Data Curation
While the AI industry obsesses over trillion-parameter architectures and massive GPU clusters, the real performance gains are hiding in the data loader. Ari Moros, CEO of Dataloggy, argues that the “bitter lesson” of AI is that inductive biases matter less than the quality and diversity of the information we feed our models.
Core Question: How can automated data curation bend the scaling laws to train models that are faster, smaller, and more performant than those trained on raw web scrapes?
Highlights
- The 1M Token Threshold: Inductive biases (like convolution) help in small-data regimes but become harmful once a model sees more than a million data points.
- Curation over Filtering: Effective data management isn’t just about deleting “junk”; it requires rebalancing concepts, sequencing curricula, and rephrasing for learnability.
- The Failure of Human Intuition: Expert researchers cannot predict which data points a model will find useful better than a coin flip; automated valuation is a necessity.
- Bending the Power Law: By focusing on the “marginal information gain” per token, researchers can move beyond diminishing returns and maintain linear performance gains.
⏱️ Reading time: approx. 8 minutes · Saves you about 71 minutes vs. watching.
Want to take notes while watching? Click the image below and let AI Notebook capture the key points for you 👇
From Neuroscience to the Science of Data
The Bitter Lesson Re-Learned
Ari Moros began his career in neuroscience, teaching mice to count while analyzing neural dynamics. This empirical background shaped his view of deep learning not as a branch of provable computer science, but as an observational science where properties are emergent and often unexpected.
Initially, Moros focused on building “soft inductive biases”—architecture tweaks that help models learn specific patterns. One notable experiment involved the “Convict” paper, where a Vision Transformer was initialized as if it were a Convolutional Neural Network (CNN). The goal was to give the model a head start while allowing it to “unlearn” the bias if necessary.
The results were a wake-up call.
While the bias helped significantly in low-data regimes (under 500,000 points), its benefits decayed as the data scaled. Once the model passed a million data points, the hand-crafted bias actually became a hindrance. The data, not the architecture, was the only thing that mattered.

💡 Digging Deeper
Q: Why is data research underinvested compared to hardware?
A: Historically, data work was viewed as “plumbing” or “grunt work” rather than high-prestige science, leading researchers to focus on architecture tweaks that often have lower marginal impact.
Q: What is the primary assumption of the Chinchilla scaling laws?
A: They assume IID (Independent and Identically Distributed) data, essentially treating every token as equal, which ignores the reality of redundancy and quality variance.
Q: Can a transformer emulate a CNN exactly?
A: Yes, by mapping specific heads to kernel parts and imposing weight tying, a transformer can be mathematically identical to a CNN at initialization.
The Art and Science of Curation
Beyond the Delete Button
Most people mistake data curation for simple filtering—the act of removing “junk” or “Not Safe For Work” content. While filtering is a component, true curation involves complex rebalancing. If you have 10,000 summaries of Hamlet, any single one might be high-quality, but keeping all of them in a training set is actively harmful to efficiency.
The value of a data point is not an intrinsic property; it is a function of how that point relates to every other point in the set.
Consider the difference between “elephants” and “dogs.” Elephants are relatively stereotyped—there are only two main types, and they generally look similar. You don’t need much data to master the concept. Dogs, however, vary wildly in size, texture, and breed. A model requires a much higher volume and diversity of dog data to achieve the same level of conceptual mastery.
The Limits of Human Expertise
A fascinating study in the DCLM (Data Comp LM) project showed that even the world’s top NLP graduate students could not predict which data points a classifier would keep or reject better than random chance. Humans simply cannot hold a trillion-token distribution in their heads to judge “marginal information gain.”
Automation isn’t just a scaling necessity; it is a quality necessity.

Synthetic Data and the “Data Wall”
Rephrasing vs. Distillation
There is a growing fear of “model collapse”—the idea that training AI on AI-generated data leads to a spiral of stupidity. Moros differentiates between two types of synthetic data to address this. The first is “distillation in disguise,” where a model generates net-new facts. This carries high risk because the model overfits to the “modes” (averages) and ignores the “tails” (unique outliers).
The second, more promising approach is “rephrasing.”
In this paradigm, the knowledge comes from a raw data source, but a model rewrites it to be more learnable. This is essentially “cleaning” the information, making it more accessible to the learner. Because the rephrasing model only needs to know how to write, not necessarily understand the deep content, a weaker model can often generate data that teaches a stronger model.
Breaking Scaling Laws
The ultimate goal of Dataloggy is to “bend” the power-law scaling curve. In a standard setup, every 10x increase in data yields diminishing returns because the model learns less from each successive token. However, if curation can keep the “marginal information gain” flat—meaning every token seen is as informative as the first—the scaling law becomes linear. This allows for training models that are 10x faster or significantly more capable for the same compute budget.

The Shift to the “Cognitive Core”
Small Models, Mile Deep
The future of the enterprise is not a trillion-parameter “god model” that knows everything. Instead, it is a fleet of smaller models—single-digit billions of parameters—that are “an inch wide and a mile deep.” These models are optimized for specific tasks where 99.999% reliability is required, and inference costs must be minimized.
Inference is the silent killer of AI budgets.
If a company spends $50 million a year on inference, deploying a model that is twice as large as necessary is a $25 million mistake. Training a smaller, curated model from scratch or via continued pre-training is often a “no-brainer” investment that pays for itself within months.
Tool Use over Knowledge Storage
Models waste massive amounts of capacity memorizing facts that could be easily retrieved via a search tool. Moros envisions a “cognitive core”—a model with minimal world knowledge but maximal reasoning and tool-use capabilities. By stripping out the “dead weight” of memorized facts through curation, we can shrink models even further without sacrificing their utility as agents.

Key Takeaways
Data is the most underinvested area of AI research relative to its impact. For years, the industry relied on “more is better,” but we are hitting a “data wall” where simply crawling more of the web provides zero marginal benefit. The transition from raw scraping to intelligent curation is the next major frontier in machine learning.
The economics of AI are shifting from training costs to inference costs. As organizations move from research to production, the demand for “small but mighty” models will skyrocket. These models can only be built through meticulous data curation, ensuring that every parameter is used for reasoning rather than redundant memorization.
Ultimately, the goal of a data-centric approach is to make high-quality model training accessible. By automating the valuation of data, we can enable organizations to train frontier-level models on their own proprietary data without needing a team of 100 PhDs to babysit the process.
Q&A
Q1: Is the Transformer the “final” architecture?
A: Not necessarily. While it’s a great advance, its success is largely due to its compatibility with self-supervised learning on unlabeled data. There are likely many equivalently good architectures we haven’t explored because we are so focused on this one.
Q2: Does more redundancy help a model generalize?
A: Some redundancy is essential for the model to see a concept in different contexts, but “infinite redundancy” (seeing the exact same data 10,000 times) leads to wasted capacity and overfitting.
Q3: Can we “align” a model during pre-training?
A: Yes. Moros argues that if you align a model during the pre-training phase by curating the data it sees, that alignment is much harder to “break” or “unlearn” through malicious fine-tuning later.
Q4: Why don’t big labs share their data curation secrets?
A: Because data is the most significant moat they have. While they might share architectural details or weights, the specific “recipe” of how they filtered and mixed their trillions of tokens is their most valuable intellectual property.
Q5: What is the biggest predictor of a good data researcher?
A: A willingness to actually “look at the data.” Many talented researchers treat data as a black box; the best ones are those who spend hours manually inspecting examples to understand why a model is failing.
Q6: Is parameter pruning dead?
A: No, but it’s difficult to realize the compute gains on current hardware (GPUs) which aren’t great at sparse matrix multiplication. Curation-based “smaller training” is a more direct path to inference efficiency today.
Q7: Will we see trillion-parameter models in the future?
A: While they will exist, the focus will likely shift to optimizing the inference of much smaller models, especially as “test-time compute” (reasoning steps) becomes a dominant paradigm.
