
📺 Today’s recommended deep-dive video: https://www.youtube.com/watch?v=oIk3R-sMX5o
Inside the Silicon: How AI Chips Compute at Scale
While large language models feel like magic, they are fundamentally powered by billions of tiny switches and metal traces. Reiner Pope, CEO of MatX, breaks down the architecture of the modern AI accelerator from the perspective of logic gates, data movement, and the physical constraints of light and electricity.
Core Question: How do we optimize the physical layout of silicon to maximize matrix multiplication while minimizing the massive “tax” of data movement?
Highlights
- Multiply-accumulate (MAC) is the fundamental primitive of AI, scaling quadratically with bit width.
- Moving data from a register to a logic unit is often more expensive than the calculation itself.
- Systolic arrays solve the communication bottleneck by “baking” matrix loops directly into the hardware.
- Clock cycles represent a trade-off between the speed of operations and the area wasted on synchronization registers.
⏱️ Reading time: approx. 12 minutes · Saves you about 68 minutes vs. watching.
Want to take notes while watching? Click the image below and let AI Notebook capture the key points for you 👇
The Atomic Unit of AI: The Multiply-Accumulate
From Logic Gates to Matrix Math
At the very bottom level of a chip, the primitives are logic gates: AND, OR, and NOT. These are connected by metal traces that must be physically laid out on silicon to perform specific functions. For AI, that function is almost exclusively matrix multiplication.
The fundamental primitive within a matrix multiply is the “multiply-accumulate” (MAC) operation. This involves multiplying pairs of numbers and adding them to a running sum. Because errors accumulate quickly during summation, AI chips often use mixed precision, such as multiplying 4-bit numbers but performing the addition at 8-bit precision to maintain accuracy.
To visualize this by hand, consider long multiplication in binary. You multiply a number by every bit position in another number, shifting the results, and then summing them all together. In hardware, this is often handled by a “Dadda multiplier,” which uses full adders—also known as 3:2 compressors—to reduce three bits of input into two bits of output. This circuit captures what humans naturally do when summing columns and carrying values, but it does so with thousands of gates operating in parallel.

💡 Digging Deeper
Q: Why is the MAC the natural primitive for AI?
A: Matrix multiplication is essentially a nested for-loop where a multiply-accumulate happens at every single step. By baking this specific math into a single circuit, chips become incredibly efficient at the one thing they do most.
Q: Why does bit-width scaling matter so much?
A: Circuit area scales quadratically with bit length. If you halve the precision (e.g., from FP8 to FP4), you don’t just get a linear benefit; you potentially save enough space to perform four times as much compute in the same area.
Q: What is a 3:2 compressor?
A: It is a full adder that takes three single-bit inputs and produces a two-bit binary output (the sum and the carry), effectively “compressing” the data footprint during large additions.
The Hidden Tax: Data Movement and Muxes
The Cost of Reading Memory
In traditional CPU or early GPU architectures, the logic unit—the part that actually does the math—is surprisingly small. The real cost is the data movement. Before a multiply-accumulate can happen, the chip must fetch three inputs from a register file and, after the math is done, write the result back.
To select a specific piece of data from a register file, the chip uses a “mux” (multiplexer). This circuit is built by ANDing every register with a mask (1 if you want the data, 0 if you don’t) and then ORing the results together. This means that to simply read a 4-bit number from an 8-entry register file, you might use 24 AND gates.
When you compare the gate count, the logic doing the work is often dwarfed by the logic just moving the data around. In many processors, moving data from storage to the logic unit is several times more expensive than the multiplication itself. This inefficiency is what motivated the industry to move toward more specialized structures like systolic arrays.

💡 Digging Deeper
Q: Is a mux “invisible” to software?
A: Yes. When a programmer writes a line of code to access a variable, the hardware physically activates thousands of gates to “select” that specific trace of electricity from the soup of the register file.
Q: Why are register files getting larger?
A: Larger register files provide more flexibility for complex applications, but they increase the “tax” of the muxes. It is a constant sizing battle for chip designers.
Systolic Arrays: Baking Loops into Metal
Amortizing the Communication Cost
A systolic array, like the Tensor Cores found in NVIDIA GPUs or the Matrix Units in TPUs, solves the data movement problem by baking the matrix multiply loop into the physical layout. Instead of fetching every single number from a central register file for every single multiply, the chip “parks” one matrix (the weights) locally within the logic units.
Data then flows through the array like blood through a heart—hence the name “systolic.” One vector enters from the left, weights stay fixed in the cells, and the partial sums flow downward. This structure allows for “x” amount of communication to support “x squared” amount of compute.
This optimization is the single most important factor in the speed of modern AI accelerators. By keeping the weight matrix stationary and only “trickle-feeding” new weights into the array slowly, the designer minimizes the expensive wiring needed to connect logic to memory.

💡 Digging Deeper
Q: How do weights get into the array?
A: They are “daisy-chained.” A number is fed into the top row, and on each clock cycle, it shifts down to the next row until the entire grid is populated.
Q: What is the trade-off of a larger systolic array?
A: Larger arrays are more efficient at big matrix multiplies but lose “utilization” if the math you need to do is smaller than the grid you built.
Timing, Throughput, and Determinism
The Global Heartbeat
Every nanosecond, the entire chip pauses to synchronize. This is the clock cycle. Registers act as “fences” that hold a bit of data until the clock strikes, at which point the data is released into a cloud of logic to find its next destination.
Chip designers must ensure that the “delay” through a cloud of logic is shorter than the clock cycle. If you want a faster clock (more gigahertz), you must use smaller clouds of logic. This often requires “pipeline register insertion,” where you split a complex calculation into two steps by putting a register in the middle.
While this increases the clock speed, it also increases the area spent on synchronization. This is why FPGAs (Field-Programmable Gate Arrays) are an order of magnitude less efficient than ASICs. FPGAs use “Lookup Tables” (LUTs) to emulate any gate, but those LUTs are massive compared to the hard-wired gates of a dedicated AI chip.

💡 Digging Deeper
Q: Why do FPGAs have 10x overhead?
A: A LUT is essentially a memory table that acts like a gate. To do the work of a single AND gate, an FPGA might use 32 gates worth of area to maintain the ability to be re-programmed later.
Q: What makes a CPU non-deterministic?
A: Features like branch predictors and caches. A cache might “hit” or “miss” based on what other programs are doing, making it impossible to know exactly how many nanoseconds an operation will take.
Key Takeaways
The history of AI chip design is essentially a war against the cost of moving information. As we move from logic gates to systolic arrays, the goal is always to maximize the amount of “work” (multiplications) performed for every “trip” data takes to and from memory.
Modern accelerators are shifting away from the “von Neumann” model of serial processing toward massive parallelism. While CPUs spend vast amounts of area on branch predictors and caches to guess what a single thread might do next, AI chips discard that complexity in favor of thousands of simple units that do exactly the same thing in lockstep.
The physical scaling of these chips is reaching a fascinating inflection point. We are now optimizing at the level of picoseconds and microns, balancing the quadratic costs of precision against the linear limits of bandwidth. The next generation of chips will likely push this even further, blurring the lines between where memory ends and logic begins.
Q&A
Q1: Why is rounding error a concern in 4-bit multiplication?
When you sum thousands of low-precision numbers, small rounding errors in each step add up. By performing the accumulation at a higher precision (like 8-bit or 16-bit), you preserve the signal through the layers of the neural network.
Q2: What is the “Weight Stationary” approach?
It is a design choice where the model’s weights are loaded into the compute units and stay there while different inputs flow through. This minimizes the energy spent moving the largest part of the model.
Q3: How does a branch predictor work in a CPU?
Because instructions take time to process, the CPU “guesses” which way a branch (like an if statement) will go five cycles before it happens. If it guesses wrong, it has to throw away all the work it did, which is a major efficiency hit.
Q4: What is the difference between a cache and a scratchpad?
A cache is managed by hardware and is non-deterministic; you don’t know if your data is there until you check. A scratchpad (like in a TPU) is managed entirely by software, giving the programmer total control and deterministic timing.
Q5: Why aren’t all chips run at 10 GHz?
Running a clock that fast would require putting registers between almost every single gate. The chip would be mostly “fences” and very little “math,” leading to high latency and poor total throughput.
Q6: What is a “gate equivalent”?
It is a unit of measurement for chip area, usually based on the size of a standard 2-input NAND gate. It helps designers compare the “cost” of different components like muxes versus adders.
Q7: Can you “funge” FP4 and FP8 circuits?
Generally, no. Because the math for each precision is physically mapped in silicon, you usually have to decide how much of the chip area to dedicate to each specific format during the design phase.
