[{"@context":"https:\/\/schema.org\/","@type":"BlogPosting","@id":"https:\/\/blog.terabox.com\/insights\/curacion-datos-ia-ari-morcos-datology#BlogPosting","mainEntityOfPage":"https:\/\/blog.terabox.com\/insights\/curacion-datos-ia-ari-morcos-datology","headline":"Curaci\u00f3n de Datos en IA: Clave para Modelos Eficientes","name":"Curaci\u00f3n de Datos en IA: Clave para Modelos Eficientes","description":"\ud83d\udcfa V\u00eddeo de estudio recomendado hoy: https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U La ciencia de la curaci\u00f3n: Por qu\u00e9 los datos son el multiplicador definitivo de la IADe la arquitectura a la &#8220;Lecci\u00f3n Amarga&#8221;El arte de la curaci\u00f3n inteligenteModelos m\u00e1s peque\u00f1os y el futuro del... ","datePublished":"2026-07-01","dateModified":"2026-07-01","author":{"@type":"Person","@id":"https:\/\/blog.terabox.com\/author\/flextech-admin\/#Person","name":"flextech-admin","url":"https:\/\/blog.terabox.com\/author\/flextech-admin\/","image":{"@type":"ImageObject","@id":"https:\/\/secure.gravatar.com\/avatar\/ad516503a11cd5ca435acc9bb6523536?s=150&#038;d=mm&#038;r=gforcedefault=1","url":"https:\/\/secure.gravatar.com\/avatar\/ad516503a11cd5ca435acc9bb6523536?s=150&#038;d=mm&#038;r=gforcedefault=1","height":96,"width":96}},"publisher":{"@type":"Organization","name":"terabox","logo":{"@type":"ImageObject","@id":"http:\/\/blog.terabox.com\/wp-content\/uploads\/2021\/11\/logo\u4ea7\u54c1\u540d-\u7ad6\u7248.png","url":"http:\/\/blog.terabox.com\/wp-content\/uploads\/2021\/11\/logo\u4ea7\u54c1\u540d-\u7ad6\u7248.png","width":900,"height":900}},"image":{"@type":"ImageObject","@id":"https:\/\/img.youtube.com\/vi\/yXPPcBlcF8U\/maxresdefault.jpg","url":"https:\/\/img.youtube.com\/vi\/yXPPcBlcF8U\/maxresdefault.jpg","height":"","width":""},"url":"https:\/\/blog.terabox.com\/insights\/curacion-datos-ia-ari-morcos-datology","video":{"@context":"http:\/\/schema.org\/","@type":"VideoObject","@id":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U#VideoObject","contentUrl":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U","name":"Better Data is All You Need \u2014\u00a0Ari Morcos, Datology","description":"Our chat with Ari shows that *data curation is the most impactful and underinvested area in AI.* He argues that the prevailing focus on model architecture and compute scaling overlooks the \"bitter lesson\" that *\"models are what they eat.\"* Effective data curation\u2014a sophisticated process involving filtering, rebalancing, sequencing (curriculum), and synthetic data generation\u2014allows for training models that are simultaneously *faster, better, and smaller.* Ari recounts his personal journey from focusing on model-centric inductive biases to realizing that data quality is the primary lever for breaking the diminishing returns of naive scaling laws. Datology's mission is to automate this complex curation process, making state-of-the-art data accessible to any organization and enabling a new paradigm of AI development where data efficiency, not just raw scale, drives progress.\n\n*Timestamps*\n00:00 Introduction\n00:46 What is Datology? The mission to train models faster, better, and smaller through data curation. \n01:59 Ari's background: From neuroscience to realizing the \"Bitter Lesson\" of AI. \n05:30 Key Insight: Inductive biases from architecture become less important and even harmful as data scale increases. \n08:08 Thesis: Data is the most underinvested area of AI research relative to its impact. \n10:15 Why data work is culturally undervalued in research and industry. \n12:19 How self-supervised learning changed everything, moving from a data-scarce to a data-abundant regime. \n17:05 Why automated curation is superior to human-in-the-loop, citing the DCLM study. \n19:22 The \"Elephants vs. Dogs\" analogy for managing data redundancy and complexity. \n22:46 A brief history and commentary on key datasets (Common Crawl, GitHub, Books3). \n26:24 Breaking naive scaling laws by improving data quality to maintain high marginal information gain. \n29:07 Datology's demonstrated impact: Achieving baseline performance 12x faster. \n34:19 The business of data: Datology's moat and its relationship with open-source datasets. \n39:12 Synthetic Data Explain\ned: The difference between risky \"net-new\" creation and powerful \"rephrasing.\" \n49:02 The Resurgence of Curriculum Learning: Why ordering data matters in the underfitting regime. \n52:55 The Future of Training: Optimizing pre-training data to make post-training more effective. \n54:49 Who is training their own models and why (Sovereign AI, large enterprises). \n57:24 \"Train Smaller\": Why inference cost makes smaller, specialized models the ultimate goal for enterprises. \n01:00:19 The problem with model pruning and why data-side solutions are complementary. \n01:03:03 On finding the smallest possible model for a given capability. \n01:06:49 Key learnings from the RC foundation model collaboration, proving that data curation \"stacks.\" \n01:09:46 Lightning Round: What data everyone wants & who should work at Datology. \n01:14:24 Commentary on Meta's superintelligence efforts and Yann LeCun's role.","thumbnailUrl":["https:\/\/i.ytimg.com\/vi\/yXPPcBlcF8U\/default.jpg","https:\/\/i.ytimg.com\/vi\/yXPPcBlcF8U\/mqdefault.jpg","https:\/\/i.ytimg.com\/vi\/yXPPcBlcF8U\/hqdefault.jpg","https:\/\/i.ytimg.com\/vi\/yXPPcBlcF8U\/sddefault.jpg","https:\/\/i.ytimg.com\/vi\/yXPPcBlcF8U\/maxresdefault.jpg"],"uploadDate":"2025-08-29T18:44:51+00:00","duration":"PT1H18M43S","embedUrl":"https:\/\/www.youtube.com\/embed\/yXPPcBlcF8U","publisher":{"@type":"Organization","@id":"https:\/\/www.youtube.com\/channel\/UCxBcwypKK-W3GHd_RZ9FZrQ#Organization","url":"https:\/\/www.youtube.com\/channel\/UCxBcwypKK-W3GHd_RZ9FZrQ","name":"Latent Space","description":"The podcast & newsletter where 170,000+ AI Engineers gather to talk models, tools and ideas. Breaking news today you will use at work tomorrow! Full show notes and newsletter at https:\/\/latent.space","logo":{"url":"https:\/\/yt3.ggpht.com\/pSTHcffCXEverYEPdjM0iIRPH-IUT4d2biIMZ_Z7bhyf6sME-laFer9vEfpFbM5tqFYJV-UsLQ=s800-c-k-c0x00ffffff-no-rj","width":800,"height":800,"@type":"ImageObject","@id":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U#VideoObject_publisher_logo_ImageObject"}},"potentialAction":{"@type":"SeekToAction","@id":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U#VideoObject_potentialAction","target":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U&t={seek_to_second_number}","startOffset-input":"required name=seek_to_second_number"},"interactionStatistic":[[{"@type":"InteractionCounter","@id":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U#VideoObject_interactionStatistic_WatchAction","interactionType":{"@type":"WatchAction"},"userInteractionCount":7028}],{"@type":"InteractionCounter","@id":"https:\/\/www.youtube.com\/watch?v=yXPPcBlcF8U#VideoObject_interactionStatistic_LikeAction","interactionType":{"@type":"LikeAction"},"userInteractionCount":153}]},"about":["Insights","\u300eSpanish\u300f"],"wordCount":1598},{"@context":"https:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"name":"Insights","item":"https:\/\/blog.terabox.com\/insights\/#breadcrumbitem"},{"@type":"ListItem","position":2,"name":"Curaci\u00f3n de Datos en IA: Clave para Modelos Eficientes","item":"https:\/\/blog.terabox.com\/insights\/curacion-datos-ia-ari-morcos-datology#breadcrumbitem"}]}]