§ 06Hardware landscape
A taxonomy of accelerators.
Four families dominate the conversation. They differ on three axes: how flexible they are, how fast they go, and how much power they draw to get there.
ClassBest whenWhyFlexibilityThroughputPower
GPU
Graphics processing unitTraining. Mixed workloads. When you don't know yet what you'll run.
Thousands of general-purpose parallel cores plus dedicated tensor units. Mature software, broad framework support.
TPU
Tensor processing unitLarge neural-network training and inference at hyperscale.
A systolic array that does one thing, matrix multiply, at extraordinary throughput per watt. Less flexible, more efficient.
NPU
Neural processing unitOn-device inference: phones, cars, laptops, cameras.
Quantized integer math at single-digit watts. Designed to run a fixed model fast without touching the cloud.
FPGA
Field-programmable gate arrayBespoke pipelines: networking, finance, signal processing, prototypes for ASICs.
You reconfigure logic blocks to fit your algorithm. Very low latency. Programming model is harder.