[{"@context":"https:\/\/schema.org\/","@type":"BlogPosting","@id":"https:\/\/blog.terabox.com\/insights\/ai-inference-llm-performance-batching-scaling-costs#BlogPosting","mainEntityOfPage":"https:\/\/blog.terabox.com\/insights\/ai-inference-llm-performance-batching-scaling-costs","headline":"Unlocking LLM Performance: Batching, Scaling, and Costs","name":"Unlocking LLM Performance: Batching, Scaling, and Costs","description":"\ud83d\udcfa Today&#8217;s recommended deep-dive video: https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw Unpacking AI Inference: Batching, Sparsity, and the Memory WallThe Core Mechanics of AI Inference: Latency and Cost Trade-offsScaling Models: Sparsity, Hardware, and Communication BottlenecksPipelining, Memory Tiers, and Advanced Cost OptimizationAI Progress, Scaling Laws, and... ","datePublished":"2026-06-24","dateModified":"2026-06-24","author":{"@type":"Person","@id":"https:\/\/blog.terabox.com\/author\/flextech-admin\/#Person","name":"flextech-admin","url":"https:\/\/blog.terabox.com\/author\/flextech-admin\/","image":{"@type":"ImageObject","@id":"https:\/\/secure.gravatar.com\/avatar\/ad516503a11cd5ca435acc9bb6523536?s=150&#038;d=mm&#038;r=gforcedefault=1","url":"https:\/\/secure.gravatar.com\/avatar\/ad516503a11cd5ca435acc9bb6523536?s=150&#038;d=mm&#038;r=gforcedefault=1","height":96,"width":96}},"publisher":{"@type":"Organization","name":"terabox","logo":{"@type":"ImageObject","@id":"http:\/\/blog.terabox.com\/wp-content\/uploads\/2021\/11\/logo\u4ea7\u54c1\u540d-\u7ad6\u7248.png","url":"http:\/\/blog.terabox.com\/wp-content\/uploads\/2021\/11\/logo\u4ea7\u54c1\u540d-\u7ad6\u7248.png","width":900,"height":900}},"image":{"@type":"ImageObject","@id":"https:\/\/img.youtube.com\/vi\/xmkSf5IS-zw\/maxresdefault.jpg","url":"https:\/\/img.youtube.com\/vi\/xmkSf5IS-zw\/maxresdefault.jpg","height":"","width":""},"url":"https:\/\/blog.terabox.com\/insights\/ai-inference-llm-performance-batching-scaling-costs","video":{"@context":"http:\/\/schema.org\/","@type":"VideoObject","@id":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw#VideoObject","contentUrl":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw","name":"How GPT, Claude, and Gemini are actually trained and served \u2013 Reiner Pope","description":"Did a very different format with Reiner Pope \u2013 a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It\u2019s a bit technical, but I encourage you to hang in there - it\u2019s really worth it. \n\nThere are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.\n\nReiner is CEO of MatX, a new chip startup (full disclosure - I\u2019m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture.\n\n\ud835\udc04\ud835\udc0f\ud835\udc08\ud835\udc12\ud835\udc0e\ud835\udc03\ud835\udc04 \ud835\udc0b\ud835\udc08\ud835\udc0d\ud835\udc0a\ud835\udc12\n* Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too!\nhttps:\/\/reiner-flashcards.vercel.app\/\n\n* Download markdown of transcript here to chat with an LLM:\nhttps:\/\/gist.github.com\/dwarkeshsp\/79100f0fdeed69d76241903bb0604dbe\n\n* Transcript: https:\/\/www.dwarkesh.com\/p\/reiner-pope\n\n\ud835\udc12\ud835\udc0f\ud835\udc0e\ud835\udc0d\ud835\udc12\ud835\udc0e\ud835\udc11\ud835\udc12\n- Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation\u2014which touched on everything from FPGAs to liquid cooling\u2014was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street\u2019s open roles at https:\/\/janestreet.com\/dwarkesh\n\n- Google\u2019s Gemma 4 is the first open model that\u2019s let me shut off the internet and create a fully disconnected \"focus machine\". This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner\u2019s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at https:\/\/goo.gle\/Gemma4\n\n- Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn\u2019t sure the best way to visualize the concept, but Cursor\u2019s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post: https:\/\/www.dwarkesh.com\/p\/what-i-learned-april-15. And if you have something to visualize yourself, go to https:\/\/cursor.com\/dwarkesh\n\n\ud835\udc13\ud835\udc08\ud835\udc0c\ud835\udc04\ud835\udc12\ud835\udc13\ud835\udc00\ud835\udc0c\ud835\udc0f\ud835\udc12\n\n0:00:00 \u2013 How batch size affects token cost and speed\n0:31:59 \u2013 How MoE models are laid out across GPU racks\n0:47:02 \u2013 How pipeline parallelism spreads model layers across racks\n1:03:27 \u2013 Why Ilya said, \u201cAs we now know, pipelining is not wise.\u201d\n1:18:49 \u2013 Because of RL, models may be 100x over-trained beyond Chinchilla-optimal\n1:32:52 \u2013 Deducing long context memory costs from API pricing\n2:03:52 \u2013 Convergent evolution between neural nets and cryptography","thumbnailUrl":["https:\/\/i.ytimg.com\/vi\/xmkSf5IS-zw\/default.jpg","https:\/\/i.ytimg.com\/vi\/xmkSf5IS-zw\/mqdefault.jpg","https:\/\/i.ytimg.com\/vi\/xmkSf5IS-zw\/hqdefault.jpg","https:\/\/i.ytimg.com\/vi\/xmkSf5IS-zw\/sddefault.jpg","https:\/\/i.ytimg.com\/vi\/xmkSf5IS-zw\/maxresdefault.jpg"],"uploadDate":"2026-04-29T17:20:27+00:00","duration":"PT2H13M41S","embedUrl":"https:\/\/www.youtube.com\/embed\/xmkSf5IS-zw","publisher":{"@type":"Organization","@id":"https:\/\/www.youtube.com\/channel\/UCXl4i9dYBrFOabk0xGmbkRA#Organization","url":"https:\/\/www.youtube.com\/channel\/UCXl4i9dYBrFOabk0xGmbkRA","name":"Dwarkesh Patel","description":"Deeply researched interviews\n","logo":{"url":"https:\/\/yt3.ggpht.com\/lG-z7sTfhFIW2Ne1oXMHvXMXyZSaA02_I17gUel0GAEj7OypsSHQ7PE91Vp4bTbpm3PTIAWJdko=s800-c-k-c0x00ffffff-no-rj","width":800,"height":800,"@type":"ImageObject","@id":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw#VideoObject_publisher_logo_ImageObject"}},"potentialAction":{"@type":"SeekToAction","@id":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw#VideoObject_potentialAction","target":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw&t={seek_to_second_number}","startOffset-input":"required name=seek_to_second_number"},"interactionStatistic":[[{"@type":"InteractionCounter","@id":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw#VideoObject_interactionStatistic_WatchAction","interactionType":{"@type":"WatchAction"},"userInteractionCount":399495}],{"@type":"InteractionCounter","@id":"https:\/\/www.youtube.com\/watch?v=xmkSf5IS-zw#VideoObject_interactionStatistic_LikeAction","interactionType":{"@type":"LikeAction"},"userInteractionCount":9255}]},"about":["Insights","\u300eEnglish\u300f"],"wordCount":4948,"keywords":["unstructured data"]},{"@context":"https:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"name":"Insights","item":"https:\/\/blog.terabox.com\/insights\/#breadcrumbitem"},{"@type":"ListItem","position":2,"name":"Unlocking LLM Performance: Batching, Scaling, and Costs","item":"https:\/\/blog.terabox.com\/insights\/ai-inference-llm-performance-batching-scaling-costs#breadcrumbitem"}]}]