The second law of thermodynamics, roughly speaking, states that the Gibbs free energy of an isolated system cannot increase, where is the enthalpy and is the entropy. This implies a few useful ideas: For example, lowering the entropy of the system increases the term, and thus requires inputting some energy.
Classically, we define as the times the information-theoretic entropy of the distribution over microstates (or, in the case where the energy of all the microstates are the same, the log of the number of microstates), where is Boltzmann's constant. However, if we try to think of this entropy just as literal bits of information, then weird things start to happen. For instance, one can think of performing computation as decreasing uncertainty.
Consider a logic gate , where is an arbitrary binary function. Before applying , the value of can be either or , but after applying it collapses to one value. Naively, we can therefore model the cost of each transistor operation as erasing of entropy, which should require at least energy.
A natural question to ask now is: Suppose the true physical limit for a single transistor operation . Currently, how much energy do these cost? Conversely, is there an energy lower bound on running, say, GPT 5? How far are we from this limit?
One naive thing to do is to estimate the theoretical cost of running a single H100 GPU. Consider the H100 GPU, which contains around transistors, and (for the PCIe version) runs at around . It uses of power (and thus per second), and can operate at . Thus, the energy required for its operations should be
So we're only about 3 orders of magnitude off. Not bad!
Now, this analysis has two issues:
- First, in practice, not all transistors fire every clock cycle, so in fact the H100 is less efficient than 3 orders of magnitude.
- More generally, returning to the "what is the minimum energy required to inference GPT 5.4 (for some fixed choice of batch size, context length, etc.)?" problem, simply thinking about things in terms of transistors ignores, for example, potential algorithmic improvements.
To illustrate the relationship between these issues we can consider the energy required to perform a FLOP.
For a given operation x := y * z (of say, 16-bit floats), in principle we are only lower bounded by the bits to write to the register for x. Thus, information theory says we need in energy.
In practice, though, the H100 does around FLOPs per second, which gives a lower bound of . This is a wildly different lower bound from before1, and it suggests that we are, in fact, much further from the theoretical optimum.
If we accept the FLOP-based bound as closer to the "true" information processing rate, we are actually around 7 orders of magnitude off, not 3. Thus, the optimality gap depends on what actual computation is being performed. Crucially, we know that a FLOP takes much more than 16 transistor operations; in particular, the computational lower bound is much stronger than the information theoretic lower bound (for instance, the information theoretic lower bound allows for weird non-circuit computations that may take less energy).
We can factorize our gap to theoretical optimum as follows: There is some fundamental transistor flipping cost in the chip, and there is some algorithmic overhead. In the limit, a single decode step from a transformer model only produces a single token (around 17 bits of information), and so there is, at least theoretically speaking, a very large overhead.
In particular, the transistor overhead and algorithmic overheads are multiplicative, which implies that progress in either can improve energy usage. Crucially, transistor overhead is mostly a function of the underlying hardware (and thus the chip foundry's methods), while algorithmic overhead is a function of the designer. Consequently, it is possible to make progress on both fronts: Both on the level of pure silicon and better physics, and via more optimal chip layouts.
The astute reader will observe that this calculation seems to imply that a 16-bit FLOP requires, on average, 5000 transistor switches, which seems high. Indeed, in practice only a fraction of the transistors are active at any given time (e.g. due to memory, or just not being a part of the relevant arithmetic circuit).