Hellas

thunderbolt-ibverbs: We have InfiniBand at home

I spent the past few weeks working on this project, I thought it might be interesting to write up a technical report on it, the motivation, the process, learnings, etc.

DISCLAIMER: all of the code in this repo (github.com/hellas-ai/thunderbolt-ibverbs) is AI-generated (mostly Codex 5.5 and Opus 4.7) — while I made an effort to understand enough of it to keep it on-track, I almost certainly failed in many instances and I'm sure the code contains many false assumptions, hallucinations and plain stupidity. No warranty or guarantee offered, for research use only, not for human consumption.

TL;DR. We write a linux kernel module and userspace shim to pretend our generic usb4 connection is a low-latency, high-performance InfiniBand device and use it to perform distributed inference across two 128GB Strix Halo mini PCs. Basic interop with Apple's native protocol is functional.

Two Strix Halo mini-PCs (strix-1, strix-2) connected by USB4

  • ~48 Gb/s per direction (~95 Gb/s bidi total) sustained ib_write_bw, 4-HCA aggregate at 1 MiB / 8 QPs with IOMMU off — vs ~2.3 Gb/s over the onboard 2.5 GbE and ~9 Gb/s for soft-RoCE on top of thunderbolt-net at the per-rail level.
  • ~7 µs one-way ib_write_lat at 64 B, single QP — vs ~28 µs over RXE/2.5 GbE and ~65 µs over RXE/TBnet.

ib_write_bw between strix-1 and strix-2, by transport and QPs