Tenstorrent ships Galaxy Blackhole, its 32-chip AI server

Each 6U Galaxy node packs 32 Blackhole accelerators, 1 TB of GDDR6, and 23 PFLOPS of FP8, listing at $110,000 per system.

What's new

Tenstorrent announced general availability of the Galaxy Blackhole platform on April 28, 2026. Each Galaxy node is a 6U system holding 32 Blackhole accelerators arranged in a 4×8 mesh, delivering 23 petaFLOPS of dense FP8 compute, 1 TB of GDDR6 memory at 16 TB/s of bandwidth, and 6.2 GB of on-chip SRAM at 2.9 PB/s. The chips are tied together by a dense Ethernet fabric carrying 100 Tbps of aggregate intra-node bandwidth, and each node exposes up to 56 ports of 800G Ethernet for scale-out, totaling 11.2 GB/s of out-of-node bandwidth. A single Galaxy node lists at $110,000. A four-node Galaxy Supercluster starts at $440,000, and Tenstorrent says the architecture is rated to scale to 144 nodes and more than four thousand chips.

Alongside availability, Tenstorrent published a set of vendor benchmarks. In "Blitz Mode," the company claims 350+ tokens per second per user and sub-4-second time-to-first-token on DeepSeek-R1-0528 671B, which it positions ahead of comparable Groq and Cerebras systems. A collaboration with Prodia generated a 720p, 81-frame video in 2.4 seconds, which Tenstorrent describes as ten times faster than leading GPU systems on the same workload.

Why it matters

Galaxy is Tenstorrent's first standalone, networked AI system. Where the prior Wormhole generation shipped as a PCIe accelerator that lived inside a host, Blackhole runs Linux on its on-die SiFive x280 cores and uses Ethernet as both the chip-to-chip and node-to-node interconnect. That collapses the NVLink-plus-InfiniBand split that NVIDIA systems carry, and it lets the same fabric scale from a single node to a multi-thousand-chip cluster. The pricing also lands against a different part of the curve than the high-end GPU market: at $110,000 for 23 PFLOPS of FP8 alongside a terabyte of memory in a 6U box, Galaxy is aimed at inference and long-context reasoning workloads where memory capacity per dollar tends to bind before peak throughput does.

Caveats

The 350-tokens-per-second and 10x video figures come from Tenstorrent and have not been independently reproduced. They are described as "Blitz Mode" peak numbers, not sustained throughput, and Tenstorrent has not published model-FLOPS-utilization data under realistic load. GDDR6 is also slower than the HBM3E used in NVIDIA Blackwell and AMD's Instinct MI355X, which will matter for memory-bandwidth-bound workloads even with the on-die SRAM. Software maturity remains a question at this scale; tt-metalium is open source, but most production inference stacks are still NVIDIA-first. Source: Tenstorrent, "Tenstorrent Enables AI At Scale with Industry-Leading Performance Deployed on Novel Networked AI Architecture," April 28, 2026.