GPU vs TPU vs NPU vs FPGA

§ 06

Hardware landscape

A taxonomy of accelerators.

Four families dominate the conversation. They differ on three axes: how flexible they are, how fast they go, and how much power they draw to get there.

ClassBest whenWhyFlexibilityThroughputPower

GPU

Graphics processing unit

Training. Mixed workloads. When you don't know yet what you'll run.

Thousands of general-purpose parallel cores plus dedicated tensor units. Mature software, broad framework support.

TPU

Tensor processing unit

Large neural-network training and inference at hyperscale.

A systolic array that does one thing, matrix multiply, at extraordinary throughput per watt. Less flexible, more efficient.

NPU

Neural processing unit

On-device inference: phones, cars, laptops, cameras.

Quantized integer math at single-digit watts. Designed to run a fixed model fast without touching the cloud.

FPGA

Field-programmable gate array

Bespoke pipelines: networking, finance, signal processing, prototypes for ASICs.

You reconfigure logic blocks to fit your algorithm. Very low latency. Programming model is harder.