Introduction #

This is the first in a series of posts which introduce classical information theory using the geometric framework — in contrast to the often-used viewpoint of binary bits. The goal is to understand how to use information theory to study probability distributions, and how they change as we gain or lose information. This article shows how the geometric formulation of entropy provides a nice way to visualise probability distributions as occupying 'volumes' in the space of possible outcomes.

The series is based on material from chapter four of my thesis, where I used these ideas to study measurement in quantum systems. We won't talk about quantum mechanics here, but I will briefly describe the problem so you can get an idea of what we can do with these tools. Suppose we wish to measure the strength of a magnetic field, whose value we denote by a Φ\Phi. We will have some prior knowledge about the field — for example it may be Gaussian distributed about some field strength ϕ1\phi_1 — so Φ\Phi has a known probability distribution. We take a quantum state ρ\rho, interact this with the field, and then make a measurement on ρ\rho, the outcome MM of which also follows a known probability distribution given by quantum mechanics. What information does MM contain about Φ\Phi? Information theory and the geometric formulation are useful tools to attack this problem.

In what follows I will avoid mathematical rigour and make intuitive arguments. For a more complete development see [Wil17] and [CT05], on which this introduction is based.

Probability preliminaries #

This post will use the language of modern probability theory, so here we briefly introduce the necessary vocabulary. The mathematical object used to model random events is called a random variable, which is a set of outcomes, called the sample space, together with an associated probability distribution. When we observe the outcome of a random variable we say that we sample it. For example a dice roll has a sample space {1,2,3,4,5,6}\{ 1,2,3,4,5,6 \} , each with an equal probability of 1/61/6, and sampling this twice might result in outcomes '22' and '44'. This is an example of a discrete random variable, where the sample space is a discrete set. Other random variables are continuous, such as if you were to sample the height of a random person.

Returning to the magnetic field, the sample space is the set of possible field strengths. This will likely be continuous. The probability distribution then encodes our prior knowledge of which outcomes are more or less likely.

Discrete entropy #

Suppose AA is a discrete random variable, with NN possible outcomes in the sample space. If the iith outcome has probability pip_i, the entropy of AA is defined as

H(A)=i=1Npilogpi=i=1Npilog(1pi).H(A)=-\sum_{i=1}^Np_i\log p_i=\sum_{i=1}^Np_i\log\left(\frac{1}{p_i}\right).

Unless explicitly stated all logarithms will be in base ee, a choice discussed at the end of section FIA. We will spend the rest of this section studying the meaning of this sum.

Suppose you were to sample AA, with knowledge of the probability distribution of samples. The term log(1/pi)\log(1/p_i) quantifies how much information you would gain upon observing the iith outcome:

  1. If pi=1p_i=1 then you already knew the outcome beforehand so gain no new information: log(1/1)=0\log(1/1)=0. The term log(1/pi)\log(1/p_i) is small for values of pi1p_i\approx 1. Highly probable events do not represent a large gain in information, as you were already 'expecting' these from the probability distribution. If NASA were to announce that they searched the skies and could find no large asteroids heading towards the Earth, they would not be telling us much more than we already assumed.
  2. If pi=ϵ1p_i=\epsilon\ll 1 then witnessing this represents a significant amount of 'surprise': the information gain log(1/ϵ)\log(1/\epsilon) is very large. You would feel like you had learned a lot of information if NASA were to announce that they had found a large asteroid on a collision course with Earth.

The entropy therefore characterises the average amount of information you gain per observation. A highly random probability distribution evenly spread among a large number of outcomes will always 'surprise' you, and has high entropy. On the other hand, a very predictable distribution concentrated amongst a few highly probable outcomes has low entropy. It can be shown that entropy takes its minimal value of H(A)=0H(A)=0 if pi=1p_i=1 for some ii, and is bounded by H(A)logNH(A)\le \log N — with equality achieved only for the uniform distribution. (The latter is shown most easily using the method of Lagrange multipliers, see for example [Wil17,§10.1].)

As an illustration let's consider a biased coin where heads has probability pp, and tails 1p1-p. The entropy of this distribution is called the binary entropy:

H(p)=plogp(1p)log(1p).H(p)=-p\log p-(1-p)\log(1-p).

We plot H(p)H(p) below:

The binary entropy H(p). This is a concave parabola-shaped function, taking minimum value of 0 when p is zero or one, and maximum of log(2) when p is one-half.

The binary entropy is zero in the deterministic case when pp is either one or zero. For these values the coin is either always heads or always tails; you know the outcome before the coin toss so gain no new information by observing the outcome. The maximum value of log(2)\log(2) is attained when the p=0.5p=0.5 and the distribution is as 'random' as possible.

The volume of a random variable #

In this section we show that entropy gives us the volume' occupied by a probability distribution, the so-called geometric interpretation of information theory [Wil17,§2]. Entropy was introduced by Claude Shannon as a tool of the information age [Sha48], when messages were being transmitted as binary sequences of zeros and ones. We will consider an example along these lines, so it will be convenient to take our logarithms to be in base two which we indicate with a subscript.

Suppose our random variable AA represents a letter in some alphabet, and sampling this multiple times gives us a message we wish to transmit. The probability distribution of AA is the relative frequency with which each letter is used. This could be seen as a crude model for the written language, or AA may be the outcome of some experiment or sensor readout that we wish to transmit to someone else. Shannon's key insight was that the probability distribution of AA introduces a fundamental limit on how efficiently we may encode a message as binary — the more spread out' the distribution, the less efficient an encoding we may use.

We consider an alphabet made up of four letters: a,b,c,d{a,b,c,d}. First suppose that in our language each occurred with equal frequency, so the probability distribution was uniform: pi=1/4p_i=1/4. In this case the entropy of AA is

H2(A)=4×(14log214)=2, H_2(A)=4\times\left(-\frac{1}{4}\log_2\frac{1}{4}\right)=2,

with the subscript in H2H_2' to denote the base 2 logarithm. If we wish to encode a message in binary, Shannon's fundamental result was to show that we would need H2(A)H_2(A) bits per character [Sha48] — the higher the entropy, the less efficient an encoding we can achieve. We will take this on faith here, for a discussion on how to make this argument see [Wil17,§2.1]. One way of encoding the outcome of AA is
{(a,00),(b,01),(c,10),(d,11)},\{(a,00),(b,01),(c,10),(d,11)\},

in which case a message aabaaaba would be 0000011000000110. If AA is uniformly distributed then two bits per character is the best we can do.

Letters do not however appear with equal frequency, and we may use this structure to create a more efficient encoding. As an extreme example consider a different language BB with the same four letters, but suppose that aa occurred 97%97\% of the time — while bb, cc, and dd each had probability 1%1\%. We could then use the code

{(a,0),(b,100),(c,101),(d,111)}, \{(a,0),(b,100),(c,101),(d,111)\},

which would encode aabaaaba as 001000001000. The number of bits used per letter now varies, but a string of ones and zeros can still be unambiguously decoded. A 0 represents an 'a', while a 1 means that the next two bits will tell you if it is a 'b', 'c', or 'd'. Since aa appears more frequently, 97%97\% of the time we would only use a single bit. The expected number of bits per character with this encoding is then 0.97×1+0.01×3×3=1.060.97\times 1+0.01\times 3\times 3=1.06 (thanks u/CompassRed for pointing out a missing 3), so this code is twice as efficient as what we had for AA. Calculating the entropy we find H2(B)0.24H_2(B)\approx 0.24, so we are still about four times less efficient than the optimal code. One way do to better would be to encode multiple characters at a time. Since 'aa' occurs so often we might let 00 represent the sequence 'aaa', and this way the average bits per letter would be less than one.

The two random variables AA and BB, despite occupying the same sample space, ended up having very different sizes. We would require two bits per character to transmit AA as a binary signal, as opposed to only 0.240.24 for BB. This 'size' is determined by the probability distribution, and quantified by the entropy. Thinking about random variables in this manner is called the geometric interpretation of entropy.

In this picture we see our random variable as occupying a sub-volume of the sample space due to the structure of its probability distribution. It may seem strange for us to use the word volume' for a discrete random variable, when we really mean 'number of points'. Points, length, area, and our usual three-dimensional 'volume' are all ways of measuring 'sizes', and which one is relevant depends on the dimension of problem at hand. In this series we will use the word 'volume' to refer to all of them. As we will discuss later this can all be made rigorous through the lens of measure theory (or see §5.2 of my thesis if you are impatient).

The volume occupied by a random variable is found by exponentiating the entropy to the power of whichever base we took our logarithm in.

  1. In the example just covered 2H2(A)=22=42^{H_2(A)}=2^2=4, and so the random variable AA — which is uniformly distributed — occupies the entire sample space.
  2. Suppose we had a variable CC which always used the letter bb with probability one. We would have H2(C)=0H_2(C)=0, and 2H2(C)=20=12^{H_2(C)}=2^0=1. The variable occupies only a single letter as expected. For our variable BB we have 2H2(B)=20.241.182^{H_2(B)}=2^{0.24}\approx 1.18, so this occupies only slightly more than a single letter.
  3. If a variable CC' were evenly distributed between aa and bb with probability 1/21/2, the entropy would be
    12log1212log12=log2,-\frac{1}{2}\log\frac{1}{2} -\frac{1}{2}\log\frac{1}{2}=\log 2,
    and 2H2(C)=22=22^{H_2(C')}=2^2=2, occupying exactly two letters out of the sample space. (Thanks to @KevinSafford, u/Avicton and u/inetic for fixing typos.)

If we return to the binary entropy of a biased coin, the sample space has two elements: heads and tails. Exponentiating H(p)H(p) gives

Exponential of the binary entropy e^H(p). This is a concave parabola-shaped function, taking minimum value of 1 when p is zero or one, and maximum value of 2 when p is one half.

We can see that in the deterministic case where p=0p=0 the random variable occupies a single element, either heads or tails. The full sample space is occupied when the coin is as random as possible, at p=0.5p=0.5.

Let us briefly discuss the role of the base two logarithm. The final volumes are exactly the same as if we used the natural logarithm, since:

2H2(A)=eH(A).2^{H_2(A)}=e^{H(A)}.

The only role of base two was the intermediate step, when we interpreted H2(A)H_2(A) as the number of bits required to transmit each character. This was because we were signalling in binary — if we were sending our signals in trinary for example (with three 'trits' rather than two 'bits') then we would have had to reach for the base three logarithm instead.

Often in literature base two is used, reflecting the role of information theory in studying digital communications. We however don't want to study transmitting information in binary, but rather how the volume of a random variable changes as we gain information. In so far as any basis is natural, we may as well choose the natural logarithm in base ee.

Conclusion #

In the geometric picture we visualise a random variable AA as occupying a sub-volume of its sample space, which is given by eH(A)e^{H(A)}. A uniform distribution is spread out evenly over the entire sample space, while a deterministic random variable occupies only a single point.

In future posts we will see that the geometric picture offers a nice way to unify the entropy of a continuous and discrete random variable. All that you need to do is switch your idea of size from the discrete 'number of points' to the continuous analogues of 'length', 'area', or three-dimensional 'volume'. This will be particularly useful when we want to describe the interaction between continuous and discrete quantities. In the quantum example I discussed at the start, the measurement MM has discrete outcomes, from which we want to infer the continuous value of the magnetic field.

The next post will discuss correlations between random variables, such as when you observe a measurement MM to try and learn about some parameter Φ\Phi. We will see how the volume of sample space occupied by the random variable shrinks as you gain more information about it, until it collapses to a single point with perfect information.

Let me know if you have any questions or comments about this article. Follow @ruvi_l on twitter for more posts like this, or join the discussion on Reddit:

Notes and further reading #

References #

[Wil17] Wilde, M. (2017). Quantum information theory (Second Edition). Cambridge University Press.

[CT05] Cover, T. M., & Thomas, J. A. (2005). Elements of Information Theory. In Elements of Information Theory. John Wiley & Sons. https://doi.org/10.1002/047174882A

[Sha48] Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x