Introduction #

In a previous post we saw how entropy quantifies the 'volume' occupied by a discrete random variable. However, many random quantities are continuous, for example if we are measuring the strength of a magnetic field. To analyse these we have to define a continuous entropy. This has some strange features: it can be negative, and depends on parameterisation. What does it mean to have a negative amount of randomness? And how can it depend on the way that you describe your random variable?

In this post we will consider the entropy of a continuous random variable, again from the geometric viewpoint. Thinking in terms of 'volumes' makes the strange features above both intuitive and expected. It turns out that the continuous entropy is a slightly different beast to the discrete case, and the geometric picture is the best way to understand this.

Defining a continuous entropy #

If AA is a discrete random variable, with nn different outcomes and associated probabilities {p1,,pn}\{p_1,\ldots,p_n\}, the discrete entropy is defined as:

[eqDiscreteEntropy]: H(A)=i=1npilogpi H(A)=-\sum_{i=1}^n p_i\log p_i.

The meaning of this expression, and its interpretation as a volume of the sample space, was discussed in the previous post.

Now, suppose we have a continuous random variable XX. This doesn't have a discrete set of outcomes, but rather takes values in some range [a,b][a,b], according to a probability density f(x)dxf(x)\mathrm{d}x. You should always include the dx\mathrm{d}x part! You need the differential to have the right units, and transform correctly when you change variables — see the appendix. For example, let XX be a Gaussian with mean 33 and variance 1.51.5. Then this can take value anywhere on the real line: [a,b]=[,][a,b]=[-\infty,\infty], and the probability density is

f(x)dx=12π(1.5)2exp((x3)22(1.5)2)dx f(x)\mathrm{d}x=\frac{1}{\sqrt{2\pi(1.5)^2}}\exp\left(-\frac{(x-3)^2}{2(1.5)^2}\right)\mathrm{d}x.

The probability of attaining any particular point xx is zero, i.e. the chance of sampling a Gaussian and getting exactly x=3.0000x=3.0000\ldots is nil. Thus f(x0)f(x_0) is not the probability that we will observe X=x0X=x_0. Rather, f(x0)dxf(x_0)\mathrm{d}x gives the probability that if we measure XX, we will observe a value between x0x_0 and x0+dxx_0+\mathrm{d}x (provided dx\mathrm{d}x is small). Returning to our Gaussian example and setting dx=0.01\mathrm{d} x=0.01 ,

f(3)×0.01=0.0026596f(3)\times 0.01=0.0026596\cdots,

which is the probability that if we measured XX, we would observe a value between 33 and 3.013.01.

Given a random variable XX , we define the continuous entropy (also called differential entropy) by naïvely taking [eqDiscreteEntropy] and swapping the sum for an integral and the probability for the probability density:

[eqContinuousEntropy]: h(X)=f(x)logf(x)  dx h(X)=-\int f(x)\log f(x)\; \mathrm{d}x .

At a first glance this may seem perfectly innocuous. However, the continuous case introduces several subtleties, which confused information theorists since it was first introduced. The continuous entropy is in fact a different beast to the discrete entropy, which is why it is written with a lowercase 'hh'.

A parametrisation problem #

One of the first indicators that something is off about [eqContinuousEntropy] comes from thinking about units. As discussed in the appendix, if xx has units then so must f(x)f (x) for f(x)dxf(x)\mathrm{d} x to be dimensionless. For example if xx is a length, then f(x) f(x) will have units 1/length. However every time we use a special function such as sine, the exponential, or the logarithm, the argument should in general be dimensionless. To see this consider the Taylor series of the exponential:

ex=n=0xnn! e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!}.

If xx has units then each term on the right hand side will have different units. You can't add length + length2+length3+..., so the right hand side is undefined. The logarithm is a bit different in that you can take the log of a quantity with units — see the appendix, but this means that the entropy itself h(X)h(X) will also have units.

Mathematically, this shows up as an ambiguity in the definition [eqContinuousEntropy]. Suppose we choose a different parameterisation y=x/2y=x/2 , and let g(y)dyg(y) \mathrm{d} y be the probability distribution with respect to yy . We would hope that the entropy computed in these coordinates will be the same:

h(X)=g(y)logg(y)  dy h(X) = -\int g(y)\log g(y)\; \mathrm{d} y .

Otherwise given a random variable XX , how do you know which is the 'one true parameterisation?'. To verify this we transform the distribution:

g(y)dy=f(x)dx=f(2y)(2dy) g(y)\mathrm{d} y= f(x)\mathrm{d}x=f(2y)(2\mathrm{d}y),

from which we see g(y)=2f(2y)g(y)=2f(2y) . From this the entropy evaluates to

f(x)logf(x)  dx=g(y)log(12g(y))  dy=g(y)logg(y)  dy+log2\begin{aligned} -\int f(x)\log f(x)\;\mathrm{d}x &= -\int g(y)\log \left( \frac{1}{2}g(y)\right)\;\mathrm{d}y \\ &= -\int g(y)\log g(y)\;\mathrm{d}y+\log 2\end{aligned} .

Thus if we have two different parameterisations for the same random variable, we will calculate two different entropies! This seems like a very big problem. If you ask "how random is the distribution of people's heights?", you don't make any reference as to if we measure those heights in centimeters or feet.

Continuous entropy as a length #

The resolution comes from looking at continuous entropy through the gemetric formulation. We discussed in the previous post that exponentiating the discrete entropy measured the volume of sample space occupied by a discrete random variable, where 'volume' was interpreted as 'number of points'. If we have a continuous (and one-dimensional) random variable, the continuous entropy also measures the occupied volume of sample space, but now by 'volume' we mean the one-dimensional 'length'.

To understand this better, consider a uniform distribution on the closed interval [a,b][a,b] , where <a<b<+-\infty < a < b <+\infty . If we let xx represent a point on the real line, this distribution has probability density

f(x)={1baa<x<b0otherwise. f(x) = \begin{cases} \frac{1}{b-a} & a<x<b \\ 0 & \mathrm{otherwise}.\end{cases}

Performing the integral in [eqContinuousEntropy] we find the entropy to be log(ba)\log(b-a). To get the geometric meaning we exponentiate this: eh(X)=ba e^{h(X)}=b-a , which is indeed the 'length' that the uniform distribution occupies.

Let's change coordinates to y=x/2y=x/2 , in which case the probability density becomes

g(y)={1(ba)/2a2<y<b20otherwise. g(y) = \begin{cases} \frac{1}{(b-a)/2} & \frac{a}{2}<y<\frac{b}{2} \\ 0 & \mathrm{otherwise}.\end{cases}

The random variable has half the 'length' as before, since it is uniformly distributed over the closed interval [a2,b2][\frac{a}{2},\frac{b}{2}]. We would therefore expect the entropy in these coordinates to be different — the paramerisation dependence isn't a bug, it's a feature! We can confirm this by computing

g(y)logg(y)  dy=log(ba2)-\int g(y)\log g(y)\;\mathrm{d}y = \log\left(\frac{b-a}{2}\right),

and taking the exponential gives a new length of ba2\frac{b-a}{2}.

If the random variable wasn't uniformly distributed, the probability density would cause it to become more concentrated on some parts of the sample space, so the overall volume would decrease. Thus the continuous entropy measures the volume of a random variable, where now by 'volume' we mean 'length'. This length must necessarily change in different coordinates, which is the reason for the parameterisation-dependence of the continuous entropy.

Continuous and discrete entropies #

Let's take a look at a Gaussian distribution with mean xˉ\bar{x} and variance σ\sigma :

f(x)dx=12πσ2exp((xxˉ)22σ2)f(x)\mathrm{d} x=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\bar{x})^2}{2\sigma^2}\right) .

The entropy of this works out to be 12log(2πeσ2)\frac{1}{2}\log \left(2\pi e\sigma^2\right) . This is independent of xˉ\bar{x}, which is expected since translating the distribution doesn't change the volume of sample space it occupies. The entropy does however increase with σ\sigma , since the distribution becomes more spread out.

If σ\sigma grows smaller the entropy shrinks, eventually becoming negative. The discrete entropy is always positive, which in the geometric interpreation is equivalent to saying that a random variable can't occupy less than a single point in sample space. The continuous entropy however can be negative, but this isn't really an issue. Exponentiating eh(X)e^{h(X)} we still end up with a positive volume, a negative continuous entropy simply means that the 'length' occupied by the random variable is less than one.

Something interesting happens when we let σ0\sigma\rightarrow 0 , in which case the continuous entropy approaches negative infinity. Then eh(X)=e=0e^{h(X)}=e^{-\infty}=0 , i.e. the random variable has a volume of zero. The meaning of this is that a Gaussian with zero variance occupies only a single point.

This highlights the contrast between the discrete entropy HH and the continuous entropy hh . Both of these measure the 'volume' of a random variable, however with different definitons of 'volume'. For the discrete entropy we mean 'number of points', while for the continuous entropy we mean 'length'. If you have a mathematical bent, they correspond to choosing different measures on sample space, an idea I explore in §5.2 of my thesis.

With these ideas in mind, suppose XX is a continuous random variable. What would be the discrete entropy H(X)H(X) ? Recall that eH(X)e^{H(X)} counts the number of points in the sample space that XX occupies. A continuous random variable occupies an inifinite number of points, thus we must have H(X)=H(X)=\infty . Similarly, consider h(A)h(A) where AA is a discrete random variable. Any discrete subset of the real line has volume zero, so we must have eh(A)=0e^{h(A)}=0 , which immediately gives h(A)=h(A)=-\infty .

Conclusion #

The continuous entropy is defined analogously to the discrete entropy, replacing the probability with probabillity density and sum with an integral. Despite this, the continuous entropy is not just the extension of the discrete entropy to continous random variables. They both quantify the volume of sample space occupied by a random variable, however the discrete entropy defines volume as 'number of points', while the continuous defines it as 'length'. Because of this the discrete entropy of a continuous random variable will be infinity, while the continuous entropy of a discrete random variable is negative infinity.

We can see then that the geometric interpretation unifies the concepts of discrete and continuous entropy, and shows how they relate to one another. Historically the continuous entropy has been regarded as strange and ill-defined, because it depends on the parameterisation and can be negative. However these features are not only natural, but expected, when we view entropy as quantifying a 'volume'.

We are now able to quantify the 'spread' of a random variable, whether it is discrete or continuous. In the next post we will see how this changes when we learn information, for example by measuring a system. This will let us analyse how 'good' a measurement is &mdas; by what factor does it shrink the spread?

Have any comments or questions? Let me know! Follow @ruvi_l on twitter for more posts like this, or join the discussion on Reddit:

Appendix #

Why you should always write the infinitesimal #

Suppose you are given a circle of random radius rr , where the probability density on the possible radii is g(r)drg(r)\mathrm{d}r . Then the probability of getting a radius between r1r_1 and r2r_2 is

[eqrProbIntegral]: r2r2g(r)dr \int_{r_2}^{r_2} g(r)\mathrm{d} r .

Let's be concrete, and take g(r)g(r) to be Gaussian distributed about some mean rˉ\bar{r} with variance σ\sigma :

[eqrDensity]: g(r)dr=12πσ2exp((rrˉ)22σ2)dr g(r)\mathrm{d} r=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(r-\bar{r})^2}{2\sigma^2}\right)\mathrm{d} r.

We may also describe the circles not by radius rr , but by their area aa , which is related by a=πr2 a=\pi r^2 (or r=a/π r=\sqrt{a/\pi} ). Then the mean area would be aˉ=πrˉ2 \bar{a}=\pi\bar{r}^2 , and the bounds rir_i in [eqrProbIntegral] would become ai=πri2a_i=\pi r_i^2 . Finally, we need to find the probability density in terms of the areas h(a)dah(a)\mathrm{d} a .

A common mistake is to takge g(r)g(r) and substitute r=a/πr=\sqrt{a/\pi}. This would give us

[eqhSubstituted]: h(a)=?g(a/π)=12πσ2exp((aaˉ)22σ2π) h(a)=_?g(\sqrt{a/\pi})=\frac{1}{\sqrt{2\pi \sigma^2}}\exp\left(-\frac{(\sqrt{a}-\sqrt{\bar{a}})^2}{2\sigma^2\pi}\right),

where the question mark is to emphasise that this is not in fact correct. We can see this by considering units. The right hand side of [eqhSubstituted] has units of 1/σ1/\sigma , where σ\sigma has units length since it is the variance of our distribution of radii. However the integral

a1a2h(a)da \int_{a_1}^{a_2}h(a)\mathrm{d} a

gives a probability, which is unitless. The infinitesimal da\mathrm{d} a has units of area, so h(a)h(a) should have units of area-1.

The correct answer comes from transforming not the density but the integral [eqrProbIntegral]. From the chain rule da=2πrdr=2πadr \mathrm{d} a = 2\pi r \mathrm{d} r = 2\sqrt{\pi a} \mathrm{d} r, so the integral becomes

a1a2g(a/π)da2πa \int_{a_1}^{a_2} g(\sqrt{a/\pi}) \frac{\mathrm{d} a}{2\sqrt{\pi a}} .

We can therefore see that

h(a)da=g(a/π)2πada h(a)\mathrm{d }a=\frac{g(\sqrt{a/\pi})}{2\sqrt{\pi a}}\mathrm{d}a .

Thus when changing units, we must always remember to transform both the density function, and the infinitesimal. We don't substitute r=a/πr=\sqrt{a/\pi} into g(r)g(r), we substitute it into g(r)drg(r)\mathrm{d} r . The infinitesimal is part of the probability density, and that's why I recommend that you always include the infinitesimal.

More generally even if you aren't transforming the distribution, including the infinitesimal is still more natural. A probability density only becomes a probability when you include the infinitesimal, and you need to keep it there for the units to be dimensionless.

The logarithm and units #

Can you take the logarithm of a quantity with units? It is certainly possible if you are sufficiently determined, and in this section we will analyse the consequences of such a reckless decision. It turns out that you can, so long as you take appropriate care.

Firstly we should think about precisely what 'units' mean. We assume a quantity to scale by a given factor under transformations that we call 'change units'. For example, if a quantity has units of centimetres, this means there is an object cm\mathrm{cm} such that if we perform the transformation centimeteres \rightarrow metres, we have cm0.01m\mathrm{cm}\rightarrow 0.01\mathrm{m}.

With this in mind, let's take the logarithm of 10cm10\,\mathrm{cm} . Using log(ab)=log(a)+log(b)\log(ab)=\log(a)+\log(b), this becomes

log(10cm)=log(10)+log(cm)\log(10\,\mathrm{cm})=\log(10)+\log(\mathrm{cm}) .

If we perform centimeteres \rightarrow metres, then cm\mathrm{cm} will transform as described above, while a number such as 1010 will stay the same:

log(0.1m)=log(10)+log(0.01m)=log(10×0.01)+log(m)\begin{aligned}\log(0.1\,\mathrm{m}) &= \log(10)+\log(0.01\mathrm{m}) \\ &=\log\left(10\times 0.01\right)+\log\left(\mathrm{m}\right)\end{aligned} .

Putting this together, we see that if you take the logarithm of a quantity with units, it ends up with 'logarithmic units'. If you make a log-plot of a quantity with units of centimetres, the axis should have log(cm)\log(\mathrm{cm}), as a reminder that you need to have an extra +log(cm)+\log(\mathrm{cm}) to account for a change in units. Then if someone wants to know what the numbers would be in metres, the answer is to add log0.01\log 0.01 to each point. Compare this to a regular plot of a quantity in centimeters, where you would multiply each point by 0.010.01 when you change units.

Now let's see how this applies to the parameterisation problem discussed earlier. The parameterisations xx and yy can be seen as different units for the real line. To account for this, we will suppose that xx has some units of 'length' which we will call L\mathrm{L}. Then dx\mathrm{d} x also has units of L\mathrm{L} , so the equation

1=f(x)  dx1 = \int f(x) \;\mathrm{d} x

requires f(x)f(x) to have units L1\mathrm{L}^{-1}.

The continuous entropy is

h=(f(x)L1)log(f(x)L1)  d(xL)h = -\int \left(f(x)\mathrm{L}^{-1}\right)\log\left(f(x)\mathrm{L}^{-1}\right)\;\mathrm{d} (x\mathrm{L}) .

The factors of L\mathrm{L} and L1\mathrm{L}^{-1} outside the logarithm cancel, giving
=f(x)logf(x)dxf(x)log(L1)dx=f(x)logf(x)dx+log(L)=h(f)x+log(L) \begin{aligned} &=-\int f(x)\log f(x)\mathrm{d} x-\int f(x)\log\left(\mathrm{L}^{-1}\right)\mathrm{d} x \\ &= -\int f(x)\log f(x)\mathrm{d} x+\log(\mathrm{L}) \\ &= h(f)\rvert_x+\log(\mathrm{L})\end{aligned},

where h(f)xh(f)\rvert_x denotes the entropy of the probability density ff when calculated in units of xLx\sim L .

The transformation y=2xy=2x can be seen as choosing new units L=2LL' = 2L . Then:

h(f)y+log(L)=h(f)x+log(L/2)=(h(f)xlog2)+log(L)\begin{aligned}h(f)\rvert_y+\log(L')&=h(f)\rvert_x+\log(L'/2)\\ &=\left(h(f)\rvert_x-\log 2\right)+\log(L')\end{aligned}.

This gives us the result that we found before.

References #

The discussion on continuous entropy was mostly inspired by
[CT05] Cover, T. M., & Thomas, J. A. (2005). Elements of Information Theory. In Elements of Information Theory. John Wiley & Sons. https://doi.org/10.1002/047174882X