I want to convince you that verification is necessary for you to safely offload cognitive work onto AI while still trusting the results. Then I want to show you three ways to do this:

Deal with floating-point nonassociativity
Remove floating-point calculations from models entirely
Verify tasks directly

I'll frame the discussion in terms of LLMs, but it applies to a much broader class of models.

(Side note: You may have seen a recent blog post by Horace He at Thinking Machines which discusses how non-determinism in tensor compute "turns on-policy RL into off-policy RL". It's a great blog post which you should read, but consequences for RL are not what I want to talk about here.)

The Problem: Closed Source is not Verifiable

Assume you have no access to model weights, code, or inputs (e.g., the system prompt). This is the state of affairs when you used a closed model provider like OpenAI or Anthropic. In this case, the model provider is able to undetectably behave in ways which are not aligned with your interests, e.g.:

Routing to lower-quality models at peak usage times
Quantizing or degrading model quality to save money
Injecting hidden messages into your context
Censoring tokens, words, or full requests and responses outright
Inserting overt or covert advertisements into responses

This is really just the AI flavour of enshittification, where closed-source software vendors extract value from you by degrading your experience.

... but even open code and open weights are not enough

Now suppose you have an open model where code, weights, and inputs are all known to you, and that you trust it to behave roughly in-line with your interests. Assume that you don't have the hardware to run this model, and instead you ask a third party compute provider to run it for you. Ideally, you could in principle verify your model's outputs as genuine and detect any naughty behaviour.

However, nondeterminism means this is still not possible in practice. Simply put, nondeterminism means model outputs are not uniquely determined by inputs, so you cannot distinguish between:

Honest behaviour (A genuine model output which is different to the one you calculate)
Dishonest behaviour (A faked model output which has been manipulated on purpose)

Model nondeterminism fundamentally arises from non-associativity of floating point arithmetic: the "Original Sin" as Horace He puts it. A simple example is the sum below

>>> 0.1 + (0.2 + 0.3)
... 0.6
>>> (0.1 + 0.2) + 0.3
... 0.6000000000000001

Even small differences like this can cause large output changes when a model contains nonlinearities:

import numpy as np

def foo(w, x, y):
    return (sum(w) > x) * y

# returns either 0 or 3.14e10, depending on chosen reduction bracketing.
foo(np.array([0.1, 0.2, 0.3]), 0.6, 3.14e10)

This example is a little contrived, but the problem does occur in real models. For example, in Nondeterminism and instability in neural network optimization the authors claim:

(..) that even one- bit changes in initial parameters result in models converging to vastly different values.

So how do we solve this?

Solution 1: Say what you mean

The essence of the problem is that a model is not a single function, but instead a set of functions considered equal up to associativity of floating point arithmetic. For example, the two "bracketings" x + (y + z) and (x + y) + z represent two different functions (because addition is not associative). But when we write sum([x, y, z]), it is ambiguous which function is meant!

In fact, this ambiguity is useful: if we avoid specifying, our compiler can choose the function which will perform best on some specified hardware-- typically in the form of specialised instructions.

The first solution to nondeterminism is then to simply not throw away the information about which bracketing was chosen by the compiler. Concretely:

Choose a model $m$, representing a set of functions equal up to FP nonassociativity
Compiler chooses a (deterministic) function $f \in m$ optimized for chosen hardware
The user runs $y = f(x)$ on the untrusted compute provider
The result $y$ is now deterministic and verifiable given inputs $x$.

Incidentally, supporting this flow is a major design goal of catgrad.

Note that there are still a couple practical challenges:

Intrinsics like exp can vary across devices (even the same device can have multiple versions; see CUDA's exp vs __expf.
Interleaving compute with other requests for efficiency requires need batch invariance; read more about this latter point in the Thinking Machines post.

Solution 2: Avoid floating-point

The second solution is simple to state but harder to do: don't use floating point arithmetic in your model. If we want to verify AI tensor compute in general, this means floats have to be eliminated at both inference and train time.

There are several approaches in this direction but to my knowledge, none achieves the gold standard of completely eliminating floating point at inference and train time (including in the optimizer) while also providing convergence guarantees.

Some approaches which get close are below:

BitNet replaces floating-point weights in linear layers with 1-bit weights, but but does not fully eliminate the use of floating point weights at inference time.
1 bit LLMs binarize all weights, but still require floats at train time
RDA defines a backprop procedure for training boolean circuits entirely without floats, but provides no convergence guarantees.
BOLD propose a method for directly training models with boolean weights, provides a convergence proof, but requires floating point values at train time.

... and here's how they fare at eliminating floats where ✅= no floats, ❓= floats in some layers, and ❌= floats required:

Paper	Inference	Training	Optimizer	Convergence
BitNet	❓	❌	❌	✅ (empirical)
1-bit LLMs	✅	❌	❌	✅ (empirical)
RDA	✅	✅	✅	❌
BOLD	✅	❌	❌	✅

My (and others) opinion is that this research direction is still underexplored: not only should we seek to replace the underlying arithmetic upon which models are built, but also reconsider the architectures and optimization procedures used.

My own contribution in this direction is a paper to appear at OPT 2025 at NeurIPS (blog post and arxiv link to follow) in which we show that convergence guarantees can be obtained even when parameter updates are fully discrete. We give an example based on multinomial sampling which still requires full precision gradients, but in principle our method allows for fully discrete inference, training and optimization with convergence guarantees as long as some standard assumptions are satisfied.

Paper	Inference	Training	Optimizer	Convergence
Multinomial	✅ None	❌ Yes	✅ Yes	✅ Yes

At Hellas we continue to investigate this direction, in particular towards finding an optimizer which satisfies the "gold standard".

Solution Level 3: Verify a task, not compute

Another solution is to embrace model nondeterminism, and instead verify tasks instead. This does not work for every task; only those which are mechanically verifiable. Nevertheless, there are several interesting/useful examples:

Mathematical: e.g., prove a theorem
Algorithmic/search: e.g., find a negative-weight cycle in this graph of exchange rates
Software engineering: e.g. write a function which passes these unit tests
Cryptographic: e.g., obtain a signature for message $m$ from the owner of public key $k$

This last point hints at a more "agentic" ecosystem: imagine for example where $k$ is the public key identifier of an online shop, and the message $m$ is a proof that a user purchased an item. This would allows an agent to show that the task has been completed "up to trust of real-world delivery".

The advantage of direct task verification is that it also unlocks "intelligence markets": one no longer has to care about which model was run (or even if it was a model at all!) Instead, one can directly verify that the model achieved the goal you wanted to achieve. This means that instead of having compute providers compete on cost to run a particular model, we can instead of intelligence providers which compete on solutions to tasks: a true, efficient market for cognitive work.

Conclusion

To sum up, we talked about three ways to offload cognitive work while still trusting the results:

Make models deterministic again
Throw away your floating point
Verify tasks directly

We're continuing to work on all three of these, so if you find this interesting, I'd love to hear from you on the Hellas discord!