Deterministic transcendental functions

2026-06-08T00:00:00+00:00

Background</h2>

We want our inference process to be deterministic, meaning that for a given model checkpoint, the same input should result in a reproducible output, identical across all platforms. This requires careful control of what optimizations the CPU/Metal/CUDA/HIP compilers are allowed to make, what platform APIs and primitives can be relied on and how operations and reductions are ordered during execution. Due to floating point rounding the latter is a notorious source of nondeterminism, making FP addition be non-associative, but it is important even in pure low precision integer arithmetic due to saturation and clamping.

Since all current realistic workloads contain floating point operations (even quantized models keep a few sensitive layers in high precision), we will not discuss integer arithmetic here.

For deterministic execution all the component building blocks of the computation graph must themselves be deterministic. While in a modern transformer model the majority of these are matmuls, there are a few blocks using transcendental functions too:

softmax relies on exp and is present in most self-attention and output layers</li>
positional embeddings use sin and cos</li>

the softplus function in Mamba-like layers is defined as ln(1+e^x)</li> </ul>
The problem is these functions are not guaranteed to produce bitwise identical results on different platforms. This snippet will likely write False for all four functions.
import torch # Create tensor on CPU a = torch.rand(200) for f in torch.sin, torch.cos, torch.log, torch.exp: print(torch.equal(f(a), f(a.to('cuda:0')).to('cpu')))</code></pre> This post will look at how to make these transcendental functions have reproducible outputs. Floating point notions</h2> IEEE 754 - The IEEE Standard for Floating-Point Arithmetic, originally published in 1985, most recently updated in 2019. It defines the floating point formats, operations, rounding modes and exceptions that a conforming implementation should support. Floating point formats - the different precision bit representations on floating point numbers. IEEE 754 describes binary16, binary32, binary64, binary128 and binary256 along with a few decimal formats. binary32 and binary64 used to be called single and double in the original text and correspond to the float and double C types. For deep learning only binary16 and binary32 are relevant. Rounding modes - the standard describes five rounding modes: toward −∞, toward ∞, toward 0, round to nearest ties away from 0, round to nearest ties to even. The latter is the most commonly used in deep learning, for determinism one should pick this one and stick to it. Correctly rounded - a floating point operation that behaves as if it computed the exact real result and then rounded it according to the chosen rounding mode. This makes it deterministic. Faithfully rounded - a floating point operation where the returned result is either one of the two floating-point numbers neighbouring the exact result. This makes it nondeterministic but the implementation is usually faster than for correctly rounded alternatives. Transcendental functions - functions that are not algebraic, so cannot be written as a polynomial equation. They can be trigonometric, exponential, hyperbolic and their inverses, for example sin, cos, exp, ln, tanh. Elementary functions - in floating point and approximation theory literature these refer to the transcendental and also some algebraic functions like sqrt(1-x^2), basic functions that are not primitive arithmetic operations but are commonly used in numerical computations. Required operations in IEEE 754: +, -, *, /, sqrt, and starting with the the 2008 update FMA (fused-multiply-add). The standard requires these to be correctly rounded. Recommended operations in IEEE 754: most of the elementary functions are recommended but not required to be part of an IEEE 754 compliant implementation. Implementing transcendental functions</h2> Since these are not required operations in IEEE 754, they generally do not have correctly rounded implementations in most libraries. If that were the case, determinism would be a solved problem for transcendentals. These functions are usually provided by the platform libm</a> or its equivalents - they are independent implementations part of the Linux glibc, LLVM libc and the CUDA and HIP runtime libraries, tailored to specific platforms with different trade-offs between performance and accuracy. Their outputs' bitwise representations are not identical due to the approximation method used, rounding choice and other implementation details. The CUDA API even has two variants, the very fast intrinsics</a> like __sinf and __cosf that translate to hardware instructions run on the GPU's Special Function Unit, and the slower but more accurate APIs building on these intrinsics. We need reproducible, reasonably performant and reasonably accurate implementations of transcendental functions for deep learning workloads. The reproducibility requirement rules out relying on platform APIs. Using a correctly rounded implementation from projects like the Core-Math</a>, RLibm</a> or SLEEF</a> as a starting point is one possibility, but these are more generic and complex than necessary for deep learning, target high precision scientific computing too and more often than not need porting to non-CPU platforms. The other option left is to implement the functions from scratch, specifically for our use case. Polynomial approximation</h3> Transcendental functions do not have exact formulas in terms of primitive operations, so they are usually implemented using approximation methods like polynomial expansions and/or lookup tables. According to the Weierstrass approximation theorem</a> any continuous function can be approximated to an arbitrary precision using a polynomial, the higher the degree the better the approximation. Briefly, to approximate a given function for a given input range, a function-specific polynomial is picked according to the desired accuracy, then each call to the function evaluates this fixed polynomial on the input. For example a very straightforward and naive implementation for sin and cos would be the sum of the first few terms of their respective Taylor series (also known as Maclaurin series in this particular case of looking at the derivatives of sin at 0): \[ \sin(x) = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!} \] and \[ \cos(x) = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!} \] import math import torch def sin_taylor(x): return x - x**3/math.factorial(3) + x**5/math.factorial(5) - x**7/math.factorial(7) def cos_taylor(x): return 1 - x**2/math.factorial(2) + x**4/math.factorial(4) - x**6/math.factorial(6) inputs = torch.linspace(-math.pi/4, math.pi/4, 100) print(torch.allclose(inputs.sin(), sin_taylor(inputs))) print(torch.allclose(inputs.cos(), cos_taylor(inputs)))</code></pre>True True</code></pre> If the input interval is small and close to zero, such an approximation can be acceptable, but in the script above changing the interval to [-π/2, π/2]</code> will cause divergence and extra terms need to be added from the Taylor expansion to keep the errors under control. In production implementations that have to deal with large input ranges another approach is needed, one that involves better polynomials and mapping the inputs to a smaller range. Minimax polynomial</h3> Because the Taylor polynomial is increasingly inaccurate as the input moves away from zero, even as more terms of the expansion are used, a so-called minimax polynomial</a> is the usual choice. At the expense of a bit more divergence close to zero, it provides uniform accuracy over the entire input range. It is called minimax because it minimizes the maximum error over the input range. Where the Taylor polynomial approximation error shoots up as we move away from zero, the minimax polynomial error is a small bounded uniform sine-like oscillation. The theory behind computing the such a polynomial involves Chebyshev nodes, Lagrange interpolation and the Remez algorithm</a>, which is the standard iterative method of producing the coefficients. Depending on the input range and the accuracy requirements, there are multiple possible minimax polynomials for a given function. The coefficients can be lifted from existing production libraries or computed from scratch using the Remez algorithm. One popular implementation is in the Sollya</a> project, which is both a library and small scripting language for safe floating-point code development. Here are invocations of Sollya from the command line to compute the minimax polynomials for sin(x)</code> and cos(x)</code>: echo "fpminimax(sin(x), [|1,3,5,7|], [|SG...|], [-pi/4, pi/4], absolute);" | sollya x * (1 + x^2 * (-0.166666507720947265625 + x^2 * (8.331983350217342376708984375e-3 + x^2 * (-1.94961365195922553539276123046875e-4)))) echo "fpminimax(cos(x), [|0,2,4,6|], [|SG...|], [-pi/4, pi/4], absolute);" | sollya 1 + x^2 * (-0.49999892711639404296875 + x^2 * (4.16561998426914215087890625e-2 + x^2 * (-1.35968066751956939697265625e-3)))</code></pre> We passed the [-π/4, π/4]</code> range on which to approximate, and the degrees of the polynomials to use. Since sin is an odd function (f(-x) = -f(x)</code>), and cos is an even function (f(-x) = f(x)</code>), their minimax polynomials also have non-zero coefficients only for odd and even powers of x respectively, but they are slightly different from the corresponding Taylor coefficients. The output of Sollya is an expression that can be used to evaluate the polynomial for any input value on the given interval. It is of the form \[C0 + x * (C1 + x* (C2 + x * (...)))\] so that the polynomial can be evaluated with fewer multiplications than the naive Python Taylor expressions above. This is known as Horner's scheme</a>. There's a parallelized version known as Estrin's scheme, but for such low degree polynomials as used in most transcendental function implementations with only 4-5 terms, it is rarely justified. On the other hand one frequent optimization is using FMA (fused multiply-add) instead of explicit multiplication and addition operations because the fused operation will only require a single rounding operation instead of two, yielding better precision. FMA is present on most modern CPUs and GPUs so it is recommended to be used consistently in a deterministic implementation. Here are examples of sin and cos approximations using the minimax polynomials computed above, evaluated using Horner's scheme, one using explicit multiplication and addition operations and the other using FMA: def sin_approx(x): """ Approximates sin(x) using a minimax polynomial of degree 7 on the reduced interval [-π/4, π/4]. """ C1 = 1 C3 = -0.166666507720947265625 C5 = 8.331983350217342376708984375e-3 C7 = -1.94961365195922553539276123046875e-4 # Horner's scheme for sin return x * (C1 + x*x*(C3 + x*x*(C5 + x*x*C7))) def cos_approx(x): """ Approximates cos(x) using a minimax polynomial of degree 6 on the reduced interval [-π/4, π/4]. """ C0 = 1 C2 = -0.49999892711639404296875 C4 = 4.16561998426914215087890625e-2 C6 = -1.35968066751956939697265625e-3 # Horner's scheme for cos using FMA # return C0 + x*x*(C2 + x*x*(C4 + x*x*C6)) x2 = x * x c = math.fma(x2, C6, C4) c = math.fma(x2, c, C2) c = math.fma(x2, c, C0) return c</code></pre>Range reduction</h3> Even with minimax polynomials it is impractical to approximate a function over a large input range. To maintain accuracy the degree of the polynomial needs to increase as the range increases, slowing down computation and causing representation issues if coefficients become too large or small. One way around this is piecewise approximation, where the input range is divided into smaller intervals and a different polynomial is used for each one, but this makes the code more complicated. The standard approach is range reduction: find a small fixed range for the input values where the approximation of the function we are interested in is good enough with small degree polynomials</li> find an algebraic relation that expresses the function using itself called on only input values from this small fixed range</li> implement the function by translating inputs to the small range, approximate the output on that range, and compute the final result based on this reduced approximation by doing an inverse translation</li> </ul> Since they are related periodic functions, sin(x) and cos(x) can be expressed as ±sin(xr) or ±cos(xr) where xr is in the [-π/4, π/4]</code> range. Let \(x = q\frac{\pi}{2} + r\) where \(r \in \left[-\frac{\pi}{4}, \frac{\pi}{4}\right]\) and \(q \in \mathbb{Z}\). Then, depending on \(q \bmod 4\): \[sin(x) = \begin{cases} \sin(r) & q \equiv 0 \pmod{4} \\ \cos(r) & q \equiv 1 \pmod{4} \\ -\sin(r) & q \equiv 2 \pmod{4} \\ -\cos(r) & q \equiv 3 \pmod{4} \end{cases}\] def reduced_sincos(x): q = (x / (math.pi/2)).round() # direct subtraction causing catastrophic cancellation # xr = x - q * (math.pi/2) # use Cody-Waite subtraction instead xr = cody_waite_subtract(x, q) sin = sin_approx(xr) cos = cos_approx(xr) match q: case 0: return sin, cos case 1: return cos, -sin case 2: return -sin, -cos case 3: return -cos, sin</code></pre> The naive reduction xr = x - q * (math.pi/2)</code> calculation can suffer from catastrophic cancellation</a> so most implementations use the Cody-Waite reduction: π/2 is expressed as a sum of constants of different magnitudes, each being exactly representable, and instead of a single subtraction, these constants are subtracted individually. def cody_waite_subtract(x, q): # P0 + P1 = π/2 P0 = float.fromhex("0x1.92p+0") P1 = float.fromhex("0x1.fb54442d18p-12") return (x - q*P0) - q*P1</code></pre> It is better to express constants as hexadecimal or binary literals to avoid any possible ambiguity in parsing and bit representation of decimal literals. This is a generic approach, used regardless of the reduction interval - for example [-π/2, π/2]</code> is used for approximating tan. When working with double precision there are variants of expressing the sum using 3 or 4 constants instead of just the 2 here. Approximating exponential and logarithm</h3> The same principles apply for log and exp as for the trigonometric functions: find a reduction interval and a formula to map arbitrarily large inputs to that interval, then use polynomial approximation on it. Unlike for the periodic trigonometric functions where these approximated values can be readily used, here we need a reconstruction step to map the values on the small interval back to the full range. Exponential</h4> For exp we reduce to the interval [0, log(2)]</code> and use a minimax polynomial where the coefficients are computed using Sollya. Any exp value can be expressed as \[ e^x = 2^k e^{x - k \log 2} \] where k is the integer part of x/log(2). echo "fpminimax(exp(x), [|0,1,2,3|], [|SG...|], [0, log(2)], absolute);" | sollya 0.9998929500579833984375 + x * (1.0047757625579833984375 + x * (0.4669305980205535888671875 + x * 0.23783318698406219482421875))</code></pre>def reduced_exp(x): k = round(x/math.log(2)) # Split log(2) in two to avoid catastrophic cancellation LN2_HI = float.fromhex('0x1.62e4000000000p-1') LN2_LO = float.fromhex('0x1.7f7d1cf780000p-20') xr = x - k*LN2_HI - k*LN2_LO C0 = 0.9998929500579833984375 C1 = 1.0047757625579833984375 C2 = 0.4669305980205535888671875 C3 = 0.23783318698406219482421875 xr = C0 + xr * (C1 + xr * (C2 + xr * C3)) return xr * (2**k)</code></pre>Logarithm</h4> The natural logarithm of a number can be expressed using the mantissa and exponent of its float representation. \[ \ln(x) = \ln(m) + e \ln(2) \] If we use frexp, the mantissa is normalized to [0.5, 1)</code>, so that is the reduction range we look for minimax coefficients on: echo "fpminimax(log(x), [|0,1,2,3|], [|SG...|], [0.5, 1], absolute);" | sollya -2.1859228610992431640625 + x * (4.22526264190673828125 + x * (-2.9164140224456787109375 + x * 0.877515852451324462890625))</code></pre>def reduced_log(x): assert x > 0 m,e = math.frexp(x) C0 = -2.1859228610992431640625 C1 = 4.22526264190673828125 C2 = -2.9164140224456787109375 C3 = 0.877515852451324462890625 rl = C0 + m * (C1 + m * (C2 + m * C3)) return rl + e*math.log(2)</code></pre>Zeros, NaNs and infinities</h3> These edge values need explicit handling because NaN representation can vary across platforms. We should pick a valid NaN bit pattern of the several available, and use it consistently. The checks for NaN and infinity should come first, before any other computation, so range reduction and approximation works on valid inputs only. The above python snippets do not include these checks. For sin and cos, Inf input should be treated as NaN and return NaN.</li> For sin, if the input is ±0 return the same sign 0.</li> For cos and exp, if the input is ±0 return 1 directly.</li> For exp, if the input is +Inf return +Inf, and if the input is -Inf return +0.</li> </ul> Deep learning specific considerations</h3> It makes sense to also implement a function that computes both sine and cosine at the same time. They share the reduction stage and often both values are needed for the same input anyway, as in the case of positional embeddings. Plain table lookup without polynomial approximation is a good option when the range of possible inputs is known to be small and fixed such as in FP8 or a subset of BF16, although these are not really used for positional embeddings due to loss of accuracy at longer contexts. The Cody-Waite method cannot very accurately compute the reduced range for sin and cosine for very large inputs (> 2^20), and in those cases Payne-Hanek reduction is used instead, but for positional embeddings we're fine with the simpler approximation. We can get away with using interval reduction and Taylor series expansion of four terms for sin, cos and exp for running LLMs, but since minimax polynomials can get the same or better accuracy with fewer terms, we prefer to use them instead. Conclusion</h3> For determinism we must pick a rounding mode, decide whether or not to use FMA, the range reduction method, the polynomial approximation method and coefficients and the evaluation method and implement the algorithm in the same way for all target platforms. These choices should be made depending on the input range, the accuracy requirements and even benchmarking various options for speed and compliance. There is a wide range of options for each of them but it is safe to just use FMA, round to nearest ties to even, a minimax polynomial generated by Sollya and Horner evaluation. References</h3> Nvidia article on floating point</a> Correctly Rounded Evaluation of a Function: Why, How, and at What Cost?</a> Elementary Functions: Algorithms and Implementation</a>, a book by Jean-Michel Muller

Inductive Types in Lean

2026-04-02T00:00:00+00:00

Inductive types in Lean allow for conservative extensions of the core theory (the calculus of inductive constructions) by adding new freely generated types. Here "conservative" is a term of art: it means that our additions do not "increase the power" of the underlying theory by adding new axioms; instead, they extend the language in a way designed to preserve consistency.

But how does this work internally? What exactly is added to the logical theory when we write inductive ...</code>? This post answers that question using natural numbers as the running example. Suppose we declare a Nat</code> type as follows:

namespace</span> MyNat</span></span>
</span>
  inductive</span> Nat</span> : </span>Type</span></span>
  | zero : Nat</span></span>
  | succ : Nat → Nat</span></span>
</span>
end</span> MyNat</span></span></code></pre>
When we do this, Lean adds constructors, or "introduction rules":</p>
Nat.zero : Nat</span></span>
Nat.succ : Nat → Nat</span></span></code></pre>
These let us construct</em> (or 'introduce') Nat</code> values.
For example, the value for 1</code> is introduced as Nat.succ Nat.zero : Nat</code>.</p>
But we also need to deconstruct</em> or ('eliminate') a Nat</code> value.
Lean adds a "recursor" Nat.rec</code> for this, whose type can be printed using
#check Nat.rec</code>:</p>
MyNat.Nat.rec.{u}</span></span>
    {motive : Nat → Sort u}</span></span>
    (zero : motive Nat.zero)</span></span>
    (succ : (a : Nat) → motive a → motive a.succ)</span></span>
    (t : Nat) : motive t</span></span></code></pre>
Spelling this out, we have:</p>

A dependent function motive</code> assigning to each Nat</code> value n</code> some type denoted motive n</code>.</li>
A base case - a value</em> zero</code> of type motive Nat.zero</code></li>
An inductive case - a function</em> succ</code> mapping a value</em> of type motive a</code> to a value motive (Nat.succ a)</code></li>
</ol>
With these three arguments applied, we're left with a function of type
(t : Nat) → motive t</code>.
This is a dependent function type</em> mapping each nat value to a result whose
type depends on the value</em>.</p>
A motivating proof-shaped use is ∀ n, n = n</code>.
In Lean, proving this means constructing a term of that type.
Nat.rec</code> is exactly the induction/recursion principle that lets us build such
terms by giving:</p>

a value at zero</code>, and</li>
a way to extend a value at a</code> to one at succ a</code>.</li>
</ol>
Together, the constructors (introduction rules) and recursor (elimination rule)
form the core logical/computational content of an inductive declaration.
However, Lean also automatically derives a couple useful utilities.</p>
Helpers: casesOn</code> and noConfusion</code></h2>
In addition to rec</code>, lean adds a special case: Nat.casesOn</code>, which lets us do
a 'shallow match' of cases, without recursing.</p>
-- casesOn</span></span>
#check</span> Nat.casesOn</span></span>
-- MyNat.Nat.casesOn.{u} {motive : Nat → </span>Sort</span> u} (t : Nat) (zero : motive Nat.zero) (succ : (a : Nat) → motive a.succ) :</span></span>
  motive t</span></span></code></pre>
As a quick aside, note that if you try to add a case to Nat with the same name
as one of these automatically added functions, you will get an error! E.g., adding</p>
inductive</span> Nat</span> : </span>Type</span></span>
| zero : Nat</span></span>
| casesOn : Nat -- will cause an error</span></span>
| succ : Nat → Nat</span></span></code></pre>
... will cause an error like this:</p>
error: (kernel) constant has already been declared 'MyNat.Nat.casesOn'</span></span></code></pre>
A more important helper is Nat.noConfusion</code>.
This is a theorem that says if t = t'</code>, then:</p>

Matching constructors imply equal arguments</li>
Different constructors are impossible</li>
</ol>
In table form:</p>
t</code></th> t'</code></th> ???</th></tr></thead>

zero</code></td> zero</code></td> Trivial - no args to compare</td></tr>
zero</code></td> succ a</code></td> Impossible - different constructors</td></tr>
succ a</code></td> zero</code></td> Impossible - different constructors</td></tr>
succ a</code></td> succ a₁</code></td> a = a₁</code></td></tr>
</tbody></table>
Note that noConfusion</code> is purely a helper: we could</em> write it by hand, but it's both tedious
and</em> mechanically derivable, so Lean gives it to us automatically.
How does this work? Let's examine Nat.noConfusion</code> by #check</code>ing it first.</p>
-- #check Nat.noConfusion</span></span>
MyNat.Nat.noConfusion.{u} {P : Sort u} {t t' : Nat} (eq : t = t') : Nat.noConfusionType P t t'</span></span></code></pre>
This isn't particularly helpful until we examine the definition</em> of noConfusionType</code>.
Informally, it unpacks as the following case analysis:</p>
t</code></th> t'</code></th> Nat.noConfusionType P t t'</code></th></tr></thead>

zero</code></td> zero</code></td> P → P</code></td></tr>
zero</code></td> succ a</code></td> P</code></td></tr>
succ a</code></td> zero</code></td> P</code></td></tr>
succ a</code></td> succ a₁</code></td> (a = a₁ → P) → P</code></td></tr>
</tbody></table>
More precisely, when we #print Nat.noConfusionType</code>, we get this nested case
analysis which first unpacks t</code>, then t'</code> within each branch.
I've indented the #print</code> for easier reading:</p>
@[reducible] protected def</span> MyNat.Nat.noConfusionType.</span>{u} : </span>Sort</span> u → Nat → Nat → </span>Sort</span> u :=</span></span>
    fun</span> P t t' =></span></span>
        Nat.casesOn t</span></span>
            (Nat.casesOn t'</span></span>
                (P → P)</span></span>
                fun</span> a => P</span></span>
            )</span></span>
            (</span>fun</span> a => Nat.casesOn t'</span></span>
                        P</span></span>
                        (</span>fun</span> a_1 => (a = a_1 → P) → P)</span></span>
            )</span></span></code></pre>
Lean will also add an injectivity helper Nat.succ.inj</code> as a useful special
case proof that Succ(a) = Succ(a') ⇒ a = a'</code> - and this generalises to cases
of other inductive types across each constructor.</p>


Will AI do to Software Engineering what Offshoring did to Manufacturing?
2025-10-31T00:00:00+00:00
AI is replacing work traditionally given to junior software engineers^{1</a></sup>.
The thesis2</a></sup> 3</a></sup> is that repetitive, boilerplate work can now be automated,
freeing senior engineers to apply the tacit</em> skills of programming which AI is not yet able to automate.</p>}
But this tacit knowledge is developed through years of practice:
the design and management of large codebases, managing complexity, knowing when
to incur tech debt, when to pay it off, and so on.</p>
So if there are no longer incentives to train junior software engineers,
what will happen to the industry?
We can make a prediction by observing how the same pattern unfolded in another
industry: manufacturing.</p>
US manufacturing employment peaked in 19794</a></sup>, after which offshoring began to
erode the labor force.
Consequently, there was less demand for trainees, resulting in a
"labor population pyramid" skewed toward older (senior) workers.
Now, as those seniors retire, they take their tacit knowledge with them, and it
becomes harder and harder to train new juniors5</a></sup>.
Ultimately, offshoring caused a vicious cycle which led to a skills gap around
30 years later6</a></sup>.</p>
Note that the opposite of this effect happens as well, to quote Pisano and
Shih</a>:</p>

Once an industrial commons has taken root in a region, a powerful virtuous
cycle feeds its growth. Experts flock there because that’s where the jobs and
knowledge networks are. Firms do the same to tap the talent pool, stay
abreast of advances, and be near suppliers and potential partners.</p>
</blockquote>
Now let's translate this to software engineering to make a prediction.
We have the same initial conditions, except instead of offshoring replacing
juniors, it's AI.
If the same pattern unfolds, then in ~30 years time we'd expect to see much of
the tacit knowledge of programming disappear from the workforce, and a similar
skills gap.</p>
Will this actually happen? This time there are some differences:</p>

AI may progress enough to replace seniors too (exacerbating the problem?)</li>
Software skills become outdated faster (we'll be training more juniors anyway)</li>
Software has a low barrier to entry for learning, and AI can help you learn</li>
Junior work is not offshored (if the AI is US-based), so is the "industrial commons" really eroded?</li>
The "industrial commons" of open source software is distributed and
accessible from anywhere</li>
</ol>
I'd bet against the erosion of software engineering purely based on (3) and
(5): it's easy enough to train oneself without significant monetary investment,
and the "industrial commons" of software exists at least partly in open source
projects where new developers can learn from seniors by contributing their
labor for free.</p>
However, I would</em> predict that new software engineers will have to front more
of the cost of learning that could previously be done on the job, and that this
will mean fewer people will select software engineering as a career for purely
economic reasons over "love of the game".</p>
Time will tell!</p>
^1</sup>Demand for junior developers softens as AI takes over</a></p>
</div>
^2</sup>https://x.com/yacineMTB/status/1984161544570335676</a></p>
</div>
^3</sup>Impact of AI on the 2025 Software Engineering Job Market</a></p>
</div>
^4</sup>Forty years of falling manufacturing employment</a></p>
</div>
^5</sup>The Manufacturing Skills Gap </a></p>
</div>
^6</sup>The skills gap in U.S. manufacturing</a></p>
</div>


Three Solutions to Nondeterminism in AI
2025-09-29T00:00:00+00:00
I want to convince you that verification is necessary</em> for you to
safely offload cognitive work onto AI while still trusting the results</em>.
Then I want to show you three ways to do this:</p>

Deal with floating-point nonassociativity</li>
Remove floating-point calculations from models entirely</li>
Verify tasks directly</li>
</ol>
I'll frame the discussion in terms of LLMs, but it applies to a much
broader class of models.</p>
(Side note: You may have seen a
recent blog post by Horace He</a>
at Thinking Machines
which discusses how non-determinism in tensor compute "turns on-policy RL into off-policy RL".
It's a great blog post which you should read, but consequences for RL are not
what I want to talk about here.)</p>
The Problem: Closed Source is not Verifiable</h2>
Assume you have no access to model weights, code, or inputs (e.g., the system prompt).
This is the state of affairs when you use a closed model provider like OpenAI or Anthropic.
In this case, the model provider is able to undetectably</em> behave in ways which
are not aligned with your interests, e.g.:</p>

Routing to lower-quality models at peak usage times</li>
Quantizing or degrading model quality to save money</li>
Injecting hidden messages into your context</li>
Censoring tokens, words, or full requests and responses outright</li>
Inserting overt or covert advertisements into responses</li>
</ul>
This is really just the AI flavour of
enshittification</a>,
where closed-source
software vendors extract value from you by degrading your experience.</p>
... but even open code and open weights are not enough</h2>
Now suppose you have an open model where code, weights, and inputs are all known to you,
and that you trust it to behave roughly in-line with your interests.
Assume that you don't have the hardware to run this model, and instead you ask
a third party compute provider to run it for you.
Ideally, you could in principle verify your model's outputs as genuine and detect any
naughty behaviour.</p>
However, nondeterminism means this is still</em> not possible in practice.
Simply put, nondeterminism means model outputs are not uniquely determined by inputs,
so you cannot distinguish between:</p>

Honest behaviour (A genuine model output which is different to the one you calculate)</li>
Dishonest behaviour (A faked model output which has been manipulated on purpose)</li>
</ol>
Model nondeterminism fundamentally arises from non-associativity of floating point arithmetic:
the "Original Sin" as Horace He puts it</a>.
A simple example is the sum below</p>
>>> 0.1 + (0.2 + 0.3)</span></span>
... 0.6</span></span>
>>> (0.1 + 0.2) + 0.3</span></span>
... 0.6000000000000001</span></span></code></pre>
Even small differences like this can cause large output changes when a model
contains nonlinearities:</p>
import</span> numpy</span> as</span> np</span></span>
</span>
def</span> foo</span>(w, x, y):</span></span>
    return</span> (</span>sum</span>(w)</span> ></span> x)</span> *</span> y</span></span>
</span>
# returns either 0 or 3.14e10, depending on chosen reduction bracketing.</span></span>
foo(np.array([</span>0.1</span>,</span> 0.2</span>,</span> 0.3</span>]),</span> 0.6</span>,</span> 3.14e10</span>)</span></span></code></pre>
This example is a little contrived, but the problem does</em> occur in real
models. For example, in
Nondeterminism and instability in neural network optimization</a>
the authors claim:</p>

(..) that even one- bit changes in initial parameters result in models
converging to vastly different values.</p>
</blockquote>
So how do we solve this?</p>
Solution 1: Say what you mean</h1>
The essence of the problem is that a model is not a single function,
but instead a set</em> of functions considered equal
up to associativity of floating point arithmetic</em>.
For example, the two "bracketings" x + (y + z)</code> and (x + y) + z</code> represent
two different functions (because addition is not associative).
But when we write sum([x, y, z])</code>, it is ambiguous which function is meant!</p>
In fact, this ambiguity is useful</em>: if we avoid specifying, our compiler can
choose the function which will perform best on some specified hardware--
typically in the form of specialised instructions.</p>
The first solution to nondeterminism is then to simply not throw away the
information</em> about which bracketing was chosen by the compiler.
Concretely:</p>

Choose a model (m), representing a set of functions equal up to FP nonassociativity</li>
Compiler chooses a (deterministic) function (f \in m) optimized for chosen hardware</li>
The user runs (y = f(x)) on the untrusted compute provider</li>
The result (y) is now deterministic and verifiable given inputs (x).</li>
</ol>
Incidentally, supporting this flow is a major design goal of catgrad</a>.</p>
Note that there are still a couple practical challenges:</p>

Intrinsics like exp</code> can vary across devices (even the same device can have
multiple versions; see CUDA's exp</code> vs
__expf</code></a>.</li>
Interleaving compute with other requests for efficiency requires batch
invariance</em>; read more about this latter point in the Thinking Machines
post</a>.</li>
</ul>
Solution 2: Avoid floating-point</h1>
The second solution is simple to state but harder to do:
don't use floating point arithmetic in your model</strong>.
If we want to verify AI tensor compute in general, this means floats have to be
eliminated at both inference and</em> train time.</p>
There are several approaches in this direction
but to my knowledge, none achieves the gold standard of completely
eliminating floating point at inference and</em> train time (including in the
optimizer) while also providing convergence guarantees.</p>
Some approaches which get close are below:</p>

BitNet</a> replaces floating-point weights in linear layers with 1-bit weights, but but does not fully eliminate the use of floating point weights at inference time.</li>
1 bit LLMs</a> binarize all</em> weights, but still require floats at train time</li>
RDA</a> defines a backprop procedure for training boolean circuits entirely without floats, but provides no convergence guarantees.</li>
BOLD</a> propose a method for directly training models with boolean weights, provides a convergence proof, but requires floating point values at train time.</li>
</ul>
... and here's how they fare at eliminating floats where ✅= no floats, ❓= floats in some layers, and ❌= floats required:</p>
Paper</th> Inference</th> Training</th> Optimizer</th> Convergence</th></tr></thead>

BitNet</a></td> ❓</td> ❌</td> ❌</td> ✅ (empirical)</td></tr>
1-bit LLMs</a></td> ✅</td> ❌</td> ❌</td> ✅ (empirical)</td></tr>
RDA</a></td> ✅</td> ✅</td> ✅</td> ❌</td></tr>
BOLD</a></td> ✅</td> ❌</td> ❌</td> ✅</td></tr>
</tbody></table>
My (and others</a>) opinion is that this
research direction is still underexplored: not only should we seek to replace
the underlying arithmetic upon which models are built, but also reconsider the
architectures and optimization procedures used.</p>
My own contribution in this direction is a paper to appear at OPT 2025 at NeurIPS
(blog post and arxiv link to follow)
in which we show that convergence guarantees can be obtained even when parameter updates are fully discrete.
We give an example based on multinomial sampling which still requires full precision gradients,
but in principle our method allows for fully discrete inference, training and
optimization with convergence guarantees as long as some standard assumptions are
satisfied.</p>
Paper</th> Inference</th> Training</th> Optimizer</th> Convergence</th></tr></thead>

Multinomial</td> ✅ None</td> ❌ Yes</td> ✅ Yes</td> ✅ Yes</td></tr>
</tbody></table>
At Hellas</a> we continue to investigate this direction,
in particular towards finding an optimizer which satisfies the "gold standard".</p>
Solution Level 3: Verify a task</em>, not compute</h1>
Another solution is to embrace model nondeterminism, and instead verify tasks</em> instead.
This does not work for every task; only those which are mechanically verifiable</em>.
Nevertheless, there are several interesting/useful examples:</p>

Mathematical: e.g., prove a theorem</li>
Algorithmic/search: e.g., find a negative-weight cycle in this graph of exchange rates</li>
Software engineering: e.g. write a function which passes these unit tests</li>
Cryptographic: e.g., obtain a signature for message (m) from the owner of public key (k)</li>
</ul>
This last point hints at a more "agentic" ecosystem: imagine for example where
(k) is the public key identifier of an online shop, and the message (m) is a
proof that a user purchased an item.
This would allow an agent to show that the task has been completed "up to
trust of real-world delivery".</p>
The advantage of direct task verification is that it also unlocks "intelligence markets":
one no longer has to care about which model was run (or even if it was a model at all!)
Instead, one can directly verify that the model achieved the goal you wanted to
achieve.
This means that instead of having compute providers</em> compete on cost to run a particular model,
we can instead of intelligence providers</em> which compete on solutions to tasks</em>:
a true, efficient market for cognitive work.</p>
Conclusion</h1>
To sum up, we talked about three ways to offload cognitive work while still trusting the results:</p>

Make models deterministic again</li>
Throw away your floating point</li>
Verify tasks directly</li>
</ol>
We're continuing to work on all three of these, so if you find this
interesting, I'd love to hear from you on the Hellas
discord</a>!</p>


Visualising LLMs with Open Hypergraphs and Catgrad
2025-06-10T00:00:00+00:00
Let's visualise some LLM architectures!
In this blog post, I'll show you how to generate these diagrams
using
open hypergraphs</a>
and
catgrad</a>.
We'll produce three examples, including the attention diagram above and a
huge SVG of every single op in the Qwen architecture.</p>
Open Hypergraphs</h2>
Let's start with
a simple example -- a residual connection around a linear layer.
In catgrad, layers and architectures are represented as Open Hypergraphs</em>:
a datastructure for representing "circuit-like" syntax.
Here's the code defining our layer:</p>
pub fn</span> residual</span>(x</span>:</span> Var</span><</span>NdArrayType</span>,</span> Operation</span>>)</span> -></span> Var</span><</span>NdArrayType</span>,</span> Operation</span>> {</span></span>
    linear_layer</span>(</span>"linear"</span>, x</span>.</span>clone</span>())</span> +</span> x</span></span>
}</span></span></code></pre>
Under the hood, this constructs an open hypergraph
which we can visualise with the
open-hypergraphs-dot</a> library:</p>
</p>
Let's break this down:</p>

Nodes are depicted as black circles ● labeled with an array shape</em> (e.g., [8, 8]</code>).</li>
Hyperedges are depicted as boxes with multiple inputs and outputs. They correspond to operations</em> like MatrixMultiply</code>.</li>
Some nodes are designated as inputs and outputs: these are depicted as dashed, open-ended lines.</li>
</ol>
Point (3) is why these are open</strong> hypergraphs, and not just hypergraphs.</p>
Importantly, copying is explicit in the hypergraph structure</strong>.
See the multiple outgoing edges of the top-right node ● which encode the reuse of
the x</code> variable inside the residual</code> function.
Aside from being a useful way to visualise variable sharing, representing
copying explicitly is important to how our
ahead-of-time autodiff algorithm</a> works,
enabling decentralised training</em> on the Hellas Network.1</a></sup></p>
Attention Please</h2>
Now let's do a complete, self-contained example: an Attention layer. Here's the code:</p>
use</span> catgrad</span>::</span>core</span>::</span>nn</span>::</span>layers</span>::*</span>;</span></span>
use</span> catgrad</span>::</span>core</span>::</span>{</span>Dtype</span>,</span> NdArrayType</span>,</span> Operation</span>,</span> Shape</span>};</span></span>
use</span> open_hypergraphs</span>::</span>lax</span>::</span>{</span>OpenHypergraph</span>,</span> functor</span>::*</span>, var,</span> var</span>::</span>Var</span>};</span></span>
</span>
use</span> std</span>::</span>cell</span>::</span>RefCell</span>;</span></span>
use</span> std</span>::</span>rc</span>::</span>Rc</span>;</span></span>
</span>
// 1. Create an OpenHypergraph for Gemma's attention layer,</span></span>
// 2. Turn explicit copy operations into *nodes* in the hypergraph</span></span>
// 3. Save as an SVG.</span></span>
pub fn</span> main</span>()</span> -></span> std</span>::</span>io</span>::</span>Result</span><()> {</span></span>
    let</span> arrow</span> =</span> attention_arrow</span>();</span></span>
    let</span> arrow</span> =</span> var</span>::</span>forget</span>::</span>Forget</span>.</span>map_arrow</span>(</span>&</span>arrow);</span></span>
    save_svg</span>(</span>&</span>arrow,</span> "images/attention.svg"</span>)</span></span>
}</span></span>
</span>
pub fn</span> attention</span>(</span></span>
    builder</span>: &</span>Rc</span><</span>RefCell</span><</span>OpenHypergraph</span><</span>NdArrayType</span>,</span> Operation</span>>>>,</span></span>
    dim</span>:</span> usize</span>,</span></span>
    name</span>: &</span>str</span>,</span></span>
    x</span>:</span> Var</span><</span>NdArrayType</span>,</span> Operation</span>>,</span></span>
)</span> -></span> Var</span><</span>NdArrayType</span>,</span> Operation</span>> {</span></span>
    let</span> num_heads</span> =</span> 4</span>;</span></span>
    let</span> head_dim</span> =</span> dim</span> /</span> num_heads;</span></span>
    let</span> b</span> =</span> x</span>.</span>label</span>.</span>shape</span>.</span>0</span>[</span>0</span>];</span></span>
    let</span> s</span> =</span> x</span>.</span>label</span>.</span>shape</span>.</span>0</span>[</span>1</span>];</span></span>
</span>
    let</span> k</span> =</span> linear</span>(builder, dim, dim,</span> &</span>format!</span>(</span>"{name}.key"</span>), x</span>.</span>clone</span>());</span></span>
    let</span> q</span> =</span> linear</span>(builder, dim, dim,</span> &</span>format!</span>(</span>"{name}.query"</span>), x</span>.</span>clone</span>());</span></span>
    let</span> v</span> =</span> linear</span>(builder, dim, dim,</span> &</span>format!</span>(</span>"{name}.value"</span>), x);</span></span>
</span>
    let</span> q</span> =</span> reshape</span>(builder,</span> Shape</span>(</span>vec!</span>[b, s, num_heads, head_dim]), q);</span></span>
    let</span> k</span> =</span> reshape</span>(builder,</span> Shape</span>(</span>vec!</span>[b, s, num_heads, head_dim]), k);</span></span>
    let</span> v</span> =</span> reshape</span>(builder,</span> Shape</span>(</span>vec!</span>[b, s, num_heads, head_dim]), v);</span></span>
</span>
    let</span> q</span> =</span> transpose</span>(builder,</span> 1</span>,</span> 2</span>, q);</span></span>
    let</span> k</span> =</span> transpose</span>(builder,</span> 1</span>,</span> 2</span>, k);</span></span>
    let</span> v</span> =</span> transpose</span>(builder,</span> 1</span>,</span> 2</span>, v);</span></span>
</span>
    let</span> tk</span> =</span> transpose</span>(builder,</span> 2</span>,</span> 3</span>, k);</span></span>
    let</span> attn</span> =</span> mat_mul</span>(builder, q, tk);</span></span>
    let</span> denom</span> =</span> constant</span>(builder, attn</span>.</span>label</span>.</span>clone</span>(),</span> f32</span>::</span>sqrt</span>(head_dim</span> as</span> f32</span>));</span></span>
    let</span> attn</span> =</span> attn</span> /</span> denom;</span></span>
    let</span> attn</span> =</span> softmax</span>(builder, attn);</span></span>
    let</span> attn</span> =</span> mat_mul</span>(builder, attn, v);</span></span>
    let</span> x</span> =</span> transpose</span>(builder,</span> 1</span>,</span> 2</span>, attn);</span></span>
    let</span> x</span> =</span> reshape</span>(builder,</span> Shape</span>(</span>vec!</span>[b, s, dim]), x);</span></span>
    linear</span>(builder, dim, dim,</span> &</span>format!</span>(</span>"{name}.proj"</span>), x)</span></span>
}</span></span>
</span>
// Build the open hypergraph by creating a Var and calling the attention function</span></span>
fn</span> attention_arrow</span>()</span> -></span> OpenHypergraph</span><</span>NdArrayType</span>,</span> Operation</span>> {</span></span>
    let</span> dim</span> =</span> 8</span>;</span></span>
    let</span> name</span> =</span> "attention"</span>;</span></span>
    var</span>::</span>build</span>(</span>|</span>state</span>|</span> {</span></span>
        let</span> x</span> =</span> Var</span>::</span>new</span>(</span></span>
            state</span>.</span>clone</span>(),</span></span>
            NdArrayType</span>::</span>new</span>(</span>Shape</span>(</span>vec!</span>[</span>1</span>,</span> 1</span>,</span> 8</span>]),</span> Dtype</span>::</span>F32</span>),</span></span>
        );</span></span>
        let</span> y</span> =</span> attention</span>(</span>&</span>state, dim, name, x</span>.</span>clone</span>());</span></span>
        (</span>vec!</span>[x],</span> vec!</span>[y])</span></span>
    })</span></span>
    .</span>unwrap</span>()</span></span>
}</span></span>
</span>
use</span> graphviz_rust</span>::</span>cmd</span>::</span>{</span>CommandArg</span>,</span> Format</span>};</span></span>
</span>
// Render an OpenHypergraph to an SVG using `open-hypergraphs-dot`</span></span>
fn</span> save_svg</span>(arrow</span>: &</span>OpenHypergraph</span><</span>NdArrayType</span>,</span> Operation</span>>, filename</span>: &</span>str</span>)</span> -></span> std</span>::</span>io</span>::</span>Result</span><()> {</span></span>
    let</span> dot_graph</span> =</span> open_hypergraphs_dot</span>::</span>generate_dot</span>(arrow);</span></span>
    let</span> png_bytes</span> =</span> graphviz_rust</span>::</span>exec</span>(</span></span>
        dot_graph,</span></span>
        &mut</span> graphviz_rust</span>::</span>printer</span>::</span>PrinterContext</span>::</span>default</span>(),</span></span>
        vec!</span>[</span>CommandArg</span>::</span>Format</span>(</span>Format</span>::</span>Svg</span>)],</span></span>
    )</span>?</span>;</span></span>
    std</span>::</span>fs</span>::</span>write</span>(filename, png_bytes)</span>?</span>;</span></span>
</span>
    Ok</span>(())</span></span>
}</span></span></code></pre>
This produces the following diagram:</p>
</p>
See here</a> for a github repo with both examples.</p>
Bonus: Qwen3-0.6B</h2>
To conclude, you can download the whole
Qwen3-0.6B</a> architecture as a diagram
here</a>.
I haven't included it on this page because it's 7MB!</p>
The qwen code is more complex, spread across multiple functions in catgrad</a>,
so to reproduce this diagram, see the code on this branch</a>,
and run:</p>
cargo run --release --example llm -- -m Qwen/Qwen3-0.6B -p "Catgrad is " -s 10 --model-svg qwen.svg</span></span></code></pre>
Finally, I want to emphasize that open hypergraphs are a general</em>
datastructure for syntax, not just for neural networks and catgrad.
Any kind of "circuit-like" term is a natural fit: from actual circuits to the
kind of boxes-and-wires visual languages used in game engines.</p>
If you have an idea for how you could use open hypergraphs and you want some help with the library,
hop in our discord</a> and let us know!</p>
^1</sup>For further reading about open hypergraphs for (differentiable) syntax,
you should know that the diagrams produced here are called
string diagrams</a>,
a formal graphical syntax for
Symmetric Monoidal Categories</a> (SMCs).
"Open Hypergraphs" (aka cospans of hypergraphs) formally</em> correspond to arrows
of SMCs, and our autodiff algorithm</a>
is built on this correspondence this to implement Reverse
Derivatives</a> for AoT autodiff.</p>
</div>


Announcing Hellas Gate
2025-05-19T00:00:00+00:00
Today we're launching Hellas Gate</a>.
Right now, it's an LLM gateway similar to
OpenRouter</a>,
LiteLLM</a>, and others, allowing you
to access models from many different providers through a central API.</p>
But this is table stakes.</p>
We're aiming for something a little different: empowering individual devs, not
enterprises.
Here's a quick taste of our roadmap, and how we're planning to do that.</p>
Local Compute</h1>
First up, we're making your local LLM accessible from anywhere.
Think "tailscale for your home GPU".</p>
You will be able to:</p>

Access your own vLLM/Ollama API from anywhere</li>
Pool shared compute resources with friends</li>
Inspect, debug, and modify the prompts and chats made by your local tools</li>
</ul>
... and more to come.</p>
Virtual Models and  Smart Routing</h1>
Next up: virtual models and smart routing.
Our aim here is to save you money, and level up your tools so they use the best
models for any given task.</p>
Writing some SQL? Use a cost-effective SQL fine-tune.</p>
Paying Anthropic $10 for RAG with claude-cli?
Use your local Qwen instance instead.</p>
Here's how it works.</p>
Model Aliases</h2>

Create a model alias like myusername/coding-rag</code>.</li>
Configure routing this alias in the Hellas dashboard. For example, have it always use QwQ-32B</code> on your local GPU.</li>
Configure your local tools to use the model myusername/coding-rag</code> for RAG</li>
</ul>
Want to change things later?
It's managed in one central place: edit your alias in the dashboard.
No fiddling with all your different tool configs!</p>
Smart Routing</h2>
Model aliases aren't just for picking one model.
You can set rules to route to different models dynamically based on criteria.
For example, let's say we want to use DeepSeek, but only when during off-peak times for lower cost.</p>
Achieve this, by configuring the myusername/coding-rag</code> alias with price limits</em>.</p>
Other filters and options in smart routing:</p>

Geography (for data privacy)</li>
Providers (e.g., "OpenAI models only")</li>
Attributes (e.g., "Best model for coding")</li>
</ul>
... and more to come.</p>
Conclusion</h1>
We're constantly improving Gate.
If you have questions, feedback, or just want to hang out, come talk to us on
discord</a>.</p>


Hello, World!
2025-05-01T00:00:00+00:00
Hello, World!</p>
Welcome to the blog for Hellas: a decentralised network for AI.</p>
We're building Hellas to guarantee an open-source future where the power of
artificial intelligence concentrates in the hands of individuals, and not</em> in
a few big companies.</p>
Keep an eye on the blog for research, product updates, and community announcements,
and make sure to join our discord</a>!</p>

References</h3>
Nvidia article on floating point</a></p>
Correctly Rounded Evaluation of a Function: Why, How, and at What Cost?</a></p>
Elementary Functions: Algorithms and Implementation</a>, a book by Jean-Michel Muller</p>

Visualising LLMs with Open Hypergraphs and Catgrad

Announcing Hellas Gate

Conclusion</h1>
We're constantly improving Gate. If you have questions, feedback, or just want to hang out, come talk to us on discord</a>.</p>

Hello, World!

Hellas Blog

Deterministic transcendental functions

thunderbolt-ibverbs: We have InfiniBand at home

Inductive Types in Lean

Will AI do to Software Engineering what Offshoring did to Manufacturing?

Three Solutions to Nondeterminism in AI

Hellas Blog

Deterministic transcendental functions

References</h3> Nvidia article on floating point</a></p> Correctly Rounded Evaluation of a Function: Why, How, and at What Cost?</a></p> Elementary Functions: Algorithms and Implementation</a>, a book by Jean-Michel Muller</p>

thunderbolt-ibverbs: We have InfiniBand at home

Inductive Types in Lean

Will AI do to Software Engineering what Offshoring did to Manufacturing?

Three Solutions to Nondeterminism in AI

Visualising LLMs with Open Hypergraphs and Catgrad

Announcing Hellas Gate

Local Compute</h1> First up, we're making your local LLM accessible from anywhere. Think "tailscale for your home GPU".</p> You will be able to:</p> Access your own vLLM/Ollama API from anywhere</li> Pool shared compute resources with friends</li>

Model Aliases</h2> Create a model alias like myusername/coding-rag</code>.</li> Configure routing this alias in the Hellas dashboard. For example, have it always use QwQ-32B</code> on your local GPU.</li>

Conclusion</h1> We're constantly improving Gate. If you have questions, feedback, or just want to hang out, come talk to us on discord</a>.</p>

Hello, World!

References</h3>
Nvidia article on floating point</a></p>
Correctly Rounded Evaluation of a Function: Why, How, and at What Cost?</a></p>
Elementary Functions: Algorithms and Implementation</a>, a book by Jean-Michel Muller</p>

Conclusion</h1>
We're constantly improving Gate. If you have questions, feedback, or just want to hang out, come talk to us on discord</a>.</p>