Hellas

V. Closing

What this lets you do

So let's count our wins. We have an open-source driver which can:

  • Pool bandwidth across multiple cables, and across both lanes of a single cable
  • Implement libibverbs, we can support vLLM, RCCL, MPI- anything that takes a libibverbs HCA
  • Demonstrate actual, real-world value for certain use-cases (finetuning, TP=2 inference)
  • Mac support is minimal- we demonstrated our Soft-DMA transport works and integration with JACCL, simple compute- but no real workloads or reason for anyone to do this currently.

The three headline numbers, one chart each:

Raw interconnect — ~95 Gb/s bidi (~48 Gb/s per direction) at 1 MiB, 4-HCA aggregate, IOMMU off:

perftest 4-HCA aggregate at 1 MiB — per-HCA write/send bandwidth, fwd vs rev, with 4-rail totals annotated

vLLM MiniMax-M2.7 (AWQ, ~230 B MoE / ~10 B active) TP=2 across both Strix boxes — a model that doesn't fit on a single Strix at all. RDMA wins ~30% over TCP-over-Thunderbolt at batch 1 and the gap narrows to ~5% by batch 8, as compute starts to amortize the interconnect cost:

vLLM MiniMax-M2.7 TP=2 on 2x Strix Halo — throughput vs batch (max_num_seqs), TBnet vs native 4-HCA RDMA

Gemma 3 27B LoRA FSDP — 1359 s on Ethernet → 126 s on 4-HCA RDMA, a 10.8× speedup:

Gemma 3 27B LoRA FSDP — 1-step wall time on 2× Strix Halo, by transport

What's this got to do with Hellas, anyway?

The Hellas mission:

run whatever compute you want, wherever you want

There's various proviso-s and elaborations to this[0][1], but fundamentally we want people to be able to run frontier-competitive AI models at home, and so having the ability to scale whatever commodity compute they have access to is critical to making this accessible to the whole planet- not just those who can afford sparkly gold or brushed aluminum computers, but those with plastic ones too.

0: more strictly- get the answer to whatever computation you want, without running it 1: even more strictly- bind two parties in a trustless/atomic 'economic value for results' transaction

What's next / Q&A

Q: So, we have this custom driver and patches that somewhat work, but that's a long way from a supported software stack like our closed-source competitors. Is Hellas going to be supporting this and maintaining it going forwards?

A: Frankly- no. It's a lot of work to do all this and even the past few weeks have been a huge distraction for me. I hope that by 'putting this out there', we can create a nucleation point for the open-source community to build something better. Probably i'll keep poking Codex yes, continue but nothing else.

Q: Okay well if you're not supporting it, can I at least use this with Catgrad to do TP inference today?

A: No. You have to use existing 3rd-party engines like vLLM and llamacpp. We'll have news to share on the new Catgrad compiler and runtime soon, but a spoiler for now: it will initially be single-node only, no RDMA at all.

Q: Why did you spend so much time on this if it doesn't work with Catgrad and Hellas yet?

A: We'll need it eventually- if we can ship it early and help the broader non-Hellas ecosystem- why not?

Try it

  • Code: https://github.com/hellas-ai/thunderbolt-ibverbs
  • Built on Nix — nix develop from the project's flake.nix gives you the patched kernel, the out-of-tree module, perftest, and a vLLM/RCCL environment in one shot.
  • Who should care: anyone running models on Strix Halo (or any AMD platform with USB4 ports), anyone building multi-box at-home inference, and kernel folks who care about the Thunderbolt subsystem.
  • Discord: https://discord.gg/qHqZyuAa3u
  • Find me on X: @0xBaltar