
📺 Today’s recommended deep-dive video: https://www.youtube.com/watch?v=oIk3R-sMX5o
Under the Silicon: How AI Chips Actually Compute
Most users see AI as a software miracle, but the real magic is a physical layout of logic gates and metal traces. Reiner Pope, CEO of MatX, breaks down the architecture that allows modern chips to perform trillions of matrix operations per second.
Core Question: How do hardware designers prioritize massive matrix multiplication while overcoming the crushing overhead of data movement and synchronization?
Highlights
- The fundamental primitive of AI hardware is the multiply-accumulate (MAC) operation.
- Data movement via multiplexers (muxes) can consume seven times more circuit area than the logic itself.
- Systolic arrays solve the “communication tax” by storing weight matrices locally where the compute happens.
- The trade-off between clock speed and throughput defines the limits of modern chip efficiency.
⏱️ Reading time: approx. 10 minutes · Saves you about 70 minutes vs. watching.
Want to take notes while watching? Click the image below and let AI Notebook capture the key points for you 👇
The Mathematics of Metal
From Logic Gates to Matrix Multiplies
Modern AI chips are physical realizations of basic logic gates like AND and OR, physically etched as metal traces on silicon wafers.
To compute a matrix multiplication, the chip performs a series of “multiply-accumulate” (MAC) operations. This involves taking pairs of numbers, multiplying them, and adding the result to a running total. Because errors accumulate during the summation, designers often use lower precision for multiplication (like FP4) and higher precision for the accumulator (like FP8). This allows for faster processing without sacrificing the numerical stability required to keep a neural network from “hallucinating” due to rounding errors.
The physical implementation of these operations relies on components like the Dadda multiplier and full adders. A full adder, or a 3-to-2 compressor, is a clever circuit that takes three input bits and produces two output bits to represent their binary sum. By layering these adders, the chip can collapse a grid of partial products into a single result. Interestingly, the area required for these circuits scales quadratically with bit width, making low-precision arithmetic the single most effective way to boost performance in the modern era of LLMs.

💡 Digging Deeper
Q: Why is the accumulator higher precision than the inputs?
A: Summing thousands of numbers leads to rounding errors that accumulate quickly. Using more bits for the addition step preserves the signal.
Q: What is a 3-to-2 compressor?
A: It is a “full adder” that takes three bits of the same position and outputs a sum bit and a carry bit, effectively reducing three rows of data into two.
Q: How does bit-width affect chip area?
A: It scales quadratically. Halving the bit precision can sometimes lead to a 4x increase in speed or area efficiency, though practical constraints often limit this to 2x or 3x.
The Communication Tax
Why Moving Data Costs More Than Computing It
In traditional architectures like CPUs, the logic unit (ALU) sits next to a register file, which acts as a small, fast memory.
Whenever the chip needs to perform a calculation, it must “select” the correct data from the registers using a multiplexer, or “mux.” This selection process is surprisingly expensive in terms of silicon real estate, often requiring significantly more gates than the multiplication itself. To a software developer, “selecting an index” feels free; to a chip designer, it is an expensive physical routing problem that consumes power and space.
When you quantify the gates, the data movement overhead becomes staggering. For a small register file, the muxes might consume three times as many gates as the actual multiply-accumulate circuit. This means that prior to specialized AI cores, we were spending the vast majority of our silicon area on “taxes”—the act of moving bits—rather than the work of intelligence. This bottleneck is what eventually forced the industry to move toward the systolic array architecture found in modern Tensor Cores.

The Systolic Solution
Tensor Cores and Local Storage
To solve the communication tax, designers moved up a level in the matrix multiplication loop by baking the entire loop into the physical hardware.
This is the birth of the systolic array, the technology behind Google’s TPU and Nvidia’s Tensor Cores. Instead of fetching every number from a central register file every single cycle, the systolic array stores the “weight” matrix locally within the logic units themselves. By “trickle-feeding” weights slowly and keeping them fixed in place, the chip can perform quadratic amounts of compute while only paying a linear price for the wiring.
This spatial arrangement allows data to flow through the chip like blood through a heart, hence the name “systolic.”
By tiling these units, chips can process massive batches of data with extreme efficiency. The trade-off is a loss of flexibility. A large systolic array is unparalleled at dense matrix math, but it becomes inefficient if the workload requires non-deterministic control, such as complex branching or sparse data. This is why chips are increasingly a hybrid of general-purpose “vector units” and specialized “matrix units.”

The Pulse of the Machine
Clock Cycles and Deterministic Latency
Every nanosecond, the entire chip pauses for a fraction of a moment to synchronize all its parallel units.
This is the global clock cycle. It ensures that if one part of the chip is calculating a sum and another is fetching a value, they don’t get out of sync. If a signal from one logic path arrives too late for the next cycle, the entire computation can fail. To avoid this, designers must carefully manage the “delay” through the logic clouds between registers to ensure every bit arrives exactly on time for the next beat.
Inserting “pipeline registers” allows a chip to run at a higher frequency by splitting long logic paths into smaller, faster steps.
However, this is a pure trade-off between clock speed and area efficiency. If you make the clock too fast, you spend all your silicon on these “stop-and-go” registers rather than actual compute logic. It’s a hardware version of the batch size problem: a faster clock (lower latency) can actually decrease total throughput because the overhead of synchronization eventually swallows the gains in speed. This is why high-frequency trading favors FPGAs, which offer deterministic, predictable timing.

Key Takeaways
The history of AI chip design is a constant battle against the physical cost of communication. While early processors treated data movement as a minor background task, modern accelerators are designed entirely around the realization that moving a bit costs more than calculating it. By freezing weights in place and using systolic arrays, chips like the H100 or TPUv5 can achieve the massive FLOP counts required for the current AI boom.
Furthermore, the transition from general-purpose CPUs to specialized matrix-movers highlights the shift toward deterministic hardware. By removing features like branch predictors and complex caches—which make CPUs fast but unpredictable—AI chip designers can pack thousands more “workers” onto the same die area. This focus on structured, parallel, and low-precision math is the fundamental engine driving the scale of modern intelligence.
Q&A
Q1: What is the main difference between a CPU and a GPU at the hardware level?
A: A CPU uses a large portion of its area for “branch predictors” and large caches to handle unpredictable code. A GPU strips these out to pack in more ALUs (Arithmetic Logic Units) for parallel tasks.
Q2: Why do AI chips use FP4 or FP8 instead of the standard FP32?
A: Lower precision requires exponentially less silicon area and power. Since neural nets are naturally resilient to small errors, this allows for a massive increase in throughput for the same cost.
Q3: What makes an FPGA unique?
A: An FPGA (Field-Programmable Gate Array) uses “lookup tables” (LUTs) that can be reprogrammed to act as any type of logic gate. This allows developers to change the chip’s physical logic after it’s been built.
Q4: Why are FPGAs roughly 10x more expensive/slower than ASICs?
A: Because they use LUTs and muxes to simulate gates. A simple 4-input AND gate that takes 3 gates in an ASIC might take 32 gates in an FPGA to maintain that programmability.
Q5: What is a “scratchpad” and how does it differ from a cache?
A: A cache is managed by hardware and is non-deterministic (you don’t know if you’ll hit or miss). A scratchpad is managed by software, giving the programmer exact control over when data moves.
Q6: How does clock speed affect throughput?
A: Throughput is the product of area efficiency (how much you do per beat) and clock speed (how many beats per second). If the clock is too fast, the area efficiency drops because you need too many registers.
Q7: Why are TPUs structured differently than GPUs?
A: TPUs use very large, coarse-grained matrix units to maximize the efficiency of huge matrix multiplies. GPUs use many smaller, tiled units (SMs) to remain more flexible for different types of graphics and compute workloads.
