The second law of thermodynamics, roughly speaking, states that the Gibbs free energy G=HTSG = H - T S of an isolated system cannot increase, where HH is the enthalpy and SS is the entropy. This implies a few useful ideas: For example, lowering the entropy of the system increases the TS-T S term, and thus requires inputting some energy.

Classically, we define SS as the kk times the information-theoretic entropy of the distribution over microstates (or, in the case where the energy of all the microstates are the same, the log of the number of microstates), where kk is Boltzmann's constant. However, if we try to think of this entropy just as literal bits of information, then weird things start to happen. For instance, one can think of performing computation as decreasing uncertainty.

Consider a logic gate C=f(A,B)C = f(A, B), where ff is an arbitrary binary function. Before applying ff, the value of CC can be either 00 or 11, but after applying it CC collapses to one value. Naively, we can therefore model the cost of each transistor operation as erasing kln2k \ln 2 of entropy, which should require at least kTln2k T \ln 2 energy.

A natural question to ask now is: Suppose the true physical limit for a single transistor operation kTln2k T \ln 2. Currently, how much energy do these cost? Conversely, is there an energy lower bound on running, say, GPT 5? How far are we from this limit?

One naive thing to do is to estimate the theoretical cost of running a single H100 GPU. Consider the H100 GPU, which contains around 810108 \cdot 10^{10} transistors, and (for the PCIe version) runs at around 2109Hz2 \cdot 10^9 \mathrm{Hz}. It uses 700W700 \mathrm{W} of power (and thus 700J700 \mathrm{J} per second), and can operate at 100°C370K100 \degree \mathrm{C} \approx 370 \mathrm{K}. Thus, the energy required for its operations should be

161019kTln2(161019)(1.41023J/K)(370K)(0.7)0.6J.16 \cdot 10^{19} k T \ln 2 \approx \left(16 \cdot 10^{19}\right)\left(1.4 \cdot 10^{-23} \mathrm{J/K}\right)(370 \mathrm{K})(0.7) \approx 0.6 \mathrm{J}.

So we're only about 3 orders of magnitude off. Not bad!

Now, this analysis has two issues:

  • First, in practice, not all transistors fire every clock cycle, so in fact the H100 is less efficient than 3 orders of magnitude.
  • More generally, returning to the "what is the minimum energy required to inference GPT 5.4 (for some fixed choice of batch size, context length, etc.)?" problem, simply thinking about things in terms of transistors ignores, for example, potential algorithmic improvements.

To illustrate the relationship between these issues we can consider the energy required to perform a FLOP. For a given operation x := y * z (of say, 16-bit floats), in principle we are only lower bounded by the bits to write to the register for x. Thus, information theory says we need 16kTln216 k T \ln 2 in energy. In practice, though, the H100 does around 210152 \cdot 10^{15} FLOPs per second, which gives a lower bound of 321015kTln232 \cdot 10^{15} k T \ln 2. This is a wildly different lower bound from before1, and it suggests that we are, in fact, much further from the theoretical optimum.

If we accept the FLOP-based bound as closer to the "true" information processing rate, we are actually around 7 orders of magnitude off, not 3. Thus, the optimality gap depends on what actual computation is being performed. Crucially, we know that a FLOP takes much more than 16 transistor operations; in particular, the computational lower bound is much stronger than the information theoretic lower bound (for instance, the information theoretic lower bound allows for weird non-circuit computations that may take less energy).

We can factorize our gap to theoretical optimum as follows: There is some fundamental transistor flipping cost in the chip, and there is some algorithmic overhead. In the limit, a single decode step from a transformer model only produces a single token (around 17 bits of information), and so there is, at least theoretically speaking, a very large overhead.

In particular, the transistor overhead and algorithmic overheads are multiplicative, which implies that progress in either can improve energy usage. Crucially, transistor overhead is mostly a function of the underlying hardware (and thus the chip foundry's methods), while algorithmic overhead is a function of the designer. Consequently, it is possible to make progress on both fronts: Both on the level of pure silicon and better physics, and via more optimal chip layouts.


  1. The astute reader will observe that this calculation seems to imply that a 16-bit FLOP requires, on average, 5000 transistor switches, which seems high. Indeed, in practice only a fraction of the transistors are active at any given time (e.g. due to memory, or just not being a part of the relevant arithmetic circuit).