Entropy in thermodynamics and information theory
In classical thermodynamics, entropy is mathematically defined as the ratio of transferred heat over temperature. Add more heat, you increase entropy. Add the same amount of heat to a hotter object, you increase it less.
The second law of thermodynamics states that heat naturally disperses and spreads out until temperatures equalize, which means that entropy always increases (or in some very special cases, not encountered in the real world, it can remain the same). This assumes an isolated system, where no heat enters or leaves, no work is done on it or by it, and no particles cross the boundary - it is completely “cut off” from the environment.
Moving away from the classical definition, we can use a more fundamental one, so-called Boltzmann definition:
Thermodynamic entropy is a measure of the number of microscopic configurations compatible with a system’s macroscopic state.
Entropy increases because systems naturally evolve towards more probable configurations. There are only so many ways a system can remain perfectly ordered, but many more ways in which it can be unordered.
Speaking of “evolve”, there is an argument that theory of evolution is incompatible with the laws of entropy. Of course, the argument is fundamentally flawed, starting with the fact that life on Earth was not an isolated system - it receives low entropy energy from the Sun and emits high entropy radiation to space. Local emergence of complex structures (such as living organisms) is therefore possible, because such processes contribute to the overall entropy production.
Now, let’s consider isolated gas in a box. We can observe its macrostate - e.g. pressure, volume, arrangement of molecules. There are a only so many microstates that produce “all molecules in one straight line”, but astronomically more microstates that produce “molecules evenly distributed across the box”. In the latter case, the macrostate can be achieved with a much higher number of different microstates. And that is what entropy measures - the probability of some macrostate, derived from how many possible microstates can create it. An isolated system tends to move from a low-probability macrostate (low entropy) to a high-probability macrostate (high entropy). The reverse is extremely unlikely, although theoretically still possible, which leads to thought experiments like the Boltzmann brain.
But what I particularly like about this definition, is that it is so fundamental, it can be applied to things completely unrelated to heat and temperature. Such as language.
This is where information theory, and consequently Claude Shannon, comes into play. Side note - that is the naming origin of Anthropic’s “Claude” model, very popular at the time of writing this text.
Shannon’s famous equation goes:
\[H(X) = -\sum_{x} p(x) \log_2 p(x)\]We chose base 2 for the logarithm, giving the unit of bits (or “shannons”).
Probabilities are therefore summed up across all outcomes. For one binary digit, there are two possible outcomes: 0 and 1.
For maximum entropy of 1 bit, the probabilities need to be equal, 0.5 each:
\[H(X) = -(0.5 * log_2 0.5 + 0.5 * log_2 0.5) = 1\]Contrast that to a situation where it is known that “0” has a 99.9% probability:
\[H(X) = -( 0.999 * log_2 0.999 + 0.001 * log_2 0.001) = 0.0114\]This is very intuitive. If we know that one outcome is much more likely, revealing the actual outcome doesn’t bring much useful information. Asking if some randomly chosen person in Germany owns a castle, and learning that they don’t, doesn’t feel very useful.
I find it very interesting to apply this principle to languages. Let’s take English. With no other considerations, just looking at the 26-letter alphabet with a uniform probability distribution, we get 4.7 bits per character.
Then we look at the letter frequency. For example, letter e is common, while letter z is rare. This drops the entropy to ~4.1 bits per character.
Then we include correlations between letter. For example, q is often followed by u. This drops the entity to about 2-3 bits per character.
We can now include words, grammar and semantics. “Peanut butter and _” is likely to be followed by “jelly”. Entropy approaches around __1 bit per character or lower. This is why English language is very prone to compression. Basic intuition is - replace frequent tokens with a short encoding, and infrequent ones with a long encoding. One famous example is the Huffman coding.
In his “Prediction and Entropy of Printed English” (link), Shannon estimated the entropy of English language to be roughly in the range of ~0.6 to ~1.3 bits per character, depending on how much context is available and how prediction is done (for example, which n-gram is used).
In all these different contexts, entropy remains rooted in the basic principle of being the measurement of uncertainty of a system’s state, based on the probabilities of its possible states.