[{"@context":"https:\/\/schema.org\/","@type":"BlogPosting","@id":"https:\/\/blog.terabox.com\/insights\/benchmarks-ia-evaluacion-time-horizons-meter#BlogPosting","mainEntityOfPage":"https:\/\/blog.terabox.com\/insights\/benchmarks-ia-evaluacion-time-horizons-meter","headline":"Benchmarks de IA: C\u00f3mo medir sus capacidades reales","name":"Benchmarks de IA: C\u00f3mo medir sus capacidades reales","description":"\ud83d\udcfa V\u00eddeo de estudio recomendado hoy: https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE \u00bfEstamos midiendo mal la IA? Por qu\u00e9 las &#8220;Horas Humanas&#8221; son la nueva m\u00e9trica del progresoEl colapso de los benchmarks tradicionalesLa m\u00e9trica de los &#8220;Horas-Humanas&#8221; (Time Horizons)Agentes, C\u00f3digo y el Espejismo de la... ","datePublished":"2026-06-24","dateModified":"2026-06-24","author":{"@type":"Person","@id":"https:\/\/blog.terabox.com\/author\/flextech-admin\/#Person","name":"flextech-admin","url":"https:\/\/blog.terabox.com\/author\/flextech-admin\/","image":{"@type":"ImageObject","@id":"https:\/\/secure.gravatar.com\/avatar\/ad516503a11cd5ca435acc9bb6523536?s=150&#038;d=mm&#038;r=gforcedefault=1","url":"https:\/\/secure.gravatar.com\/avatar\/ad516503a11cd5ca435acc9bb6523536?s=150&#038;d=mm&#038;r=gforcedefault=1","height":96,"width":96}},"publisher":{"@type":"Organization","name":"terabox","logo":{"@type":"ImageObject","@id":"http:\/\/blog.terabox.com\/wp-content\/uploads\/2021\/11\/logo\u4ea7\u54c1\u540d-\u7ad6\u7248.png","url":"http:\/\/blog.terabox.com\/wp-content\/uploads\/2021\/11\/logo\u4ea7\u54c1\u540d-\u7ad6\u7248.png","width":900,"height":900}},"image":{"@type":"ImageObject","@id":"https:\/\/img.youtube.com\/vi\/zSAGzfspuDE\/maxresdefault.jpg","url":"https:\/\/img.youtube.com\/vi\/zSAGzfspuDE\/maxresdefault.jpg","height":"","width":""},"url":"https:\/\/blog.terabox.com\/insights\/benchmarks-ia-evaluacion-time-horizons-meter","video":{"@context":"http:\/\/schema.org\/","@type":"VideoObject","@id":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE#VideoObject","contentUrl":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE","name":"The AI Progress Chart Everyone Is Misreading \u2014 Beth Barnes & David Rein","description":"Beth Barnes and David Rein on the one graph that ate the AI timelines discourse, and why the two people who built it are the most careful about how you read it.\n\n**SPONSOR**\nProlific - Quality data. From real people. For faster breakthroughs.\nhttps:\/\/www.prolific.com\/?utm_source=mlst\nInterview: https:\/\/youtu.be\/cnxZZTl1tkk\n---\n\nBeth Barnes and David Rein from METR on the one graph that ate the AI timelines discourse, and why the people who built it are the most careful about how it gets read.\n\nBeth founded METR after leaving OpenAI alignment. David is first author on GPQA and co-author on HCAST and the METR Time Horizons paper. Together they built the measurement Daniel Kokotajlo called the single most important piece of evidence on AI timelines: the log-linear line of \"how long a task a frontier model can complete at 50% reliability\" vs release date.\n\nThe conversation opens on reward hacking. Current models can articulate in chat why a behaviour is undesired and then execute it anyway as agents. From there: construct validity, Melanie Mitchell's four-problem taxonomy, and the ARC-AGI 1-to-2 collapse as a worked example of adversarially-selected benchmarks regressing once labs target them. Beth's counter: METR deliberately does not adversarially select. David's: models do not have to do the right thing for the right reasons.\n\nMethodology, then specification \u2014 David's compiler analogy, Beth on four-month tasks as expensive to evaluate rather than unspecifiable. Then the SWE-bench reality check, the METR finding that half of passing PRs would not be merged, and Beth's horses-versus-bank-tellers analogy for the labour market.\n\nThe close: monitorability, the coin-spinning boat, two-year recursive self-improvement, and Beth's line that \"overhyped now\" and \"big deal later\" are not correlated claims.\n\n---\nTIMESTAMPS:\n00:00:00 Intro\n00:02:06 Sponsor break: Prolific human-feedback infrastructure\n00:02:33 Welcome and the scalable oversight motivation\n00:06:02 Construct validity, benchmark pathologies and the Chollet worry\n00:15:45 Time Horizons: human time, HCAST tasks and the 50% logistic\n00:24:50 Is human difficulty really one variable?\n00:33:05 Agent harness evolution and the inference-compute dividend\n00:40:00 Scaffolding bells, token budgets and the credit-assignment problem\n00:44:15 Look at the damn graph: regularisation bug and reliability nuance\n00:50:00 Why 50%? Reliability, reward hacking and pizza-party transcripts\n00:55:20 Extrapolation risk and straight lines on graphs\n00:59:25 Software engineering as a specification acquisition problem\n01:07:40 Compilers also made ugly code: vibe-coding quality and Claude on METR Slack\n01:15:15 Strongest defensible claim, Carlini's compiler swarm and AI 2027\n01:23:45 SWE-bench merge rates, the bank-teller analogy and horses\n01:31:45 Scheming, alignment faking and the mentalistic vocabulary problem\n01:40:45 Reward hacking, monitorability and chain-of-thought faithfulness\n01:45:25 Recursive self-improvement, knowledge vs intelligence and closing\n\nSee top comment for references!","thumbnailUrl":["https:\/\/i.ytimg.com\/vi\/zSAGzfspuDE\/default.jpg","https:\/\/i.ytimg.com\/vi\/zSAGzfspuDE\/mqdefault.jpg","https:\/\/i.ytimg.com\/vi\/zSAGzfspuDE\/hqdefault.jpg","https:\/\/i.ytimg.com\/vi\/zSAGzfspuDE\/sddefault.jpg","https:\/\/i.ytimg.com\/vi\/zSAGzfspuDE\/maxresdefault.jpg"],"uploadDate":"2026-05-04T11:37:38+00:00","duration":"PT1H53M27S","embedUrl":"https:\/\/www.youtube.com\/embed\/zSAGzfspuDE","publisher":{"@type":"Organization","@id":"https:\/\/www.youtube.com\/channel\/UCMLtBahI5DMrt0NPvDSoIRQ#Organization","url":"https:\/\/www.youtube.com\/channel\/UCMLtBahI5DMrt0NPvDSoIRQ","name":"Machine Learning Street Talk","description":"MLST is the leading highly technical AI podcast. Subscribe now! Welcome! We bring you the latest in advanced AI research, from the best AI experts in the world. Our approach is unrivalled in terms of scope and rigour \u2013 we believe in diversity of ideas (which is to say, not just LLMs!) and we also cover other promising alternative paths to AGI, as well as CogSci, CompSci, Neuro, Mathematics, Philosophy of Mind and Language.  \n\nSupport us on Patreon for early access, exclusive content, private Discord, biweekly calls and much more! \nhttps:\/\/www.patreon.com\/mlst \n\nDonate here: https:\/\/www.paypal.com\/donate\/?hosted_button_id=K2TYRVPBGXVNA\n\nPlease email us to learn about sponsorship packages and deals. \ntim at mlst.ai (please put your budget in the subject line)\n\nPodcast booking agencies - *don't contact us* - we wouldn't even interview anyone who needed a booking agent.\nMedia\/influence agencies - *don't contact us* - we only work directly with brands\/sponsors.\n","logo":{"url":"https:\/\/yt3.ggpht.com\/15Akj76BG8IsM5ctgqVwKXArl6IfIVFAbuGa1kOomoioRgJgXHHaLmMAW7iHTMRUoEfyjTtq8lg=s800-c-k-c0x00ffffff-no-rj","width":800,"height":800,"@type":"ImageObject","@id":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE#VideoObject_publisher_logo_ImageObject"}},"potentialAction":{"@type":"SeekToAction","@id":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE#VideoObject_potentialAction","target":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE&t={seek_to_second_number}","startOffset-input":"required name=seek_to_second_number"},"interactionStatistic":[[{"@type":"InteractionCounter","@id":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE#VideoObject_interactionStatistic_WatchAction","interactionType":{"@type":"WatchAction"},"userInteractionCount":20630}],{"@type":"InteractionCounter","@id":"https:\/\/www.youtube.com\/watch?v=zSAGzfspuDE#VideoObject_interactionStatistic_LikeAction","interactionType":{"@type":"LikeAction"},"userInteractionCount":405}]},"about":["Insights","\u300eSpanish\u300f"],"wordCount":1623},{"@context":"https:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"name":"Insights","item":"https:\/\/blog.terabox.com\/insights\/#breadcrumbitem"},{"@type":"ListItem","position":2,"name":"Benchmarks de IA: C\u00f3mo medir sus capacidades reales","item":"https:\/\/blog.terabox.com\/insights\/benchmarks-ia-evaluacion-time-horizons-meter#breadcrumbitem"}]}]