Geometric interpretation of continuous entropy

02 Jun 2021information theory entropy classical information geometric interpretation continuous entropy

Introduction #

In a previous post we saw how entropy quantifies the 'volume' occupied by a discrete random variable. However, many random quantities are continuous, for example if we are measuring the strength of a magnetic field. To analyse these we have to define a continuous entropy. This has some strange features: it can be negative, and depends on parameterisation. What does it mean to have a negative amount of randomness? And how can it depend on the way that you describe your random variable?

In this post we will consider the entropy of a continuous random variable, again from the geometric viewpoint. Thinking in terms of 'volumes' makes the strange features above both intuitive and expected. It turns out that the continuous entropy is a slightly different beast to the discrete case, and the geometric picture is the best way to understand this.

Defining a continuous entropy #

If $A$ is a discrete random variable, with $n$ different outcomes and associated probabilities $\{p_1,\ldots,p_n\}$ , the discrete entropy is defined as:

[eqDiscreteEntropy]: $H(A)=-\sum_{i=1}^n p_i\log p_i$ .

The meaning of this expression, and its interpretation as a volume of the sample space, was discussed in the previous post.

Now, suppose we have a continuous random variable $X$ . This doesn't have a discrete set of outcomes, but rather takes values in some range $[a,b]$ , according to a probability density $f(x)\mathrm{d}x$ . You should always include the $\mathrm{d}x$ part! You need the differential to have the right units, and transform correctly when you change variables — see the appendix. For example, let $X$ be a Gaussian with mean $3$ and variance $1.5$ . Then this can take value anywhere on the real line: $[a,b]=[-\infty,\infty]$ , and the probability density is

$f(x)\mathrm{d}x=\frac{1}{\sqrt{2\pi(1.5)^2}}\exp\left(-\frac{(x-3)^2}{2(1.5)^2}\right)\mathrm{d}x$ .

The probability of attaining any particular point $x$ is zero, i.e. the chance of sampling a Gaussian and getting exactly $x=3.0000\ldots$ is nil. Thus $f(x_0)$ is not the probability that we will observe $X=x_0$ . Rather, $f(x_0)\mathrm{d}x$ gives the probability that if we measure $X$ , we will observe a value between $x_0$ and $x_0+\mathrm{d}x$ (provided $\mathrm{d}x$ is small). Returning to our Gaussian example and setting $\mathrm{d} x=0.01$ ,

$f(3)\times 0.01=0.0026596\cdots$ ,

which is the probability that if we measured $X$ , we would observe a value between $3$ and $3.01$ .

Given a random variable $X$ , we define the continuous entropy (also called differential entropy) by naïvely taking [eqDiscreteEntropy] and swapping the sum for an integral and the probability for the probability density:

[eqContinuousEntropy]: $h(X)=-\int f(x)\log f(x)\; \mathrm{d}x$ .

At a first glance this may seem perfectly innocuous. However, the continuous case introduces several subtleties, which confused information theorists since it was first introduced. The continuous entropy is in fact a different beast to the discrete entropy, which is why it is written with a lowercase ' $h$ '.

A parametrisation problem #

One of the first indicators that something is off about [eqContinuousEntropy] comes from thinking about units. As discussed in the appendix, if $x$ has units then so must $f (x)$ for $f(x)\mathrm{d} x$ to be dimensionless. For example if $x$ is a length, then $f(x)$ will have units 1/length. However every time we use a special function such as sine, the exponential, or the logarithm, the argument should in general be dimensionless. To see this consider the Taylor series of the exponential:

$e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!}$ .

If $x$ has units then each term on the right hand side will have different units. You can't add length + length²+length³+..., so the right hand side is undefined. The logarithm is a bit different in that you can take the log of a quantity with units — see the appendix, but this means that the entropy itself $h(X)$ will also have units.

Mathematically, this shows up as an ambiguity in the definition [eqContinuousEntropy]. Suppose we choose a different parameterisation $y=x/2$ , and let $g(y) \mathrm{d} y$ be the probability distribution with respect to $y$ . We would hope that the entropy computed in these coordinates will be the same:

$h(X) = -\int g(y)\log g(y)\; \mathrm{d} y$ .

Otherwise given a random variable $X$ , how do you know which is the 'one true parameterisation?'. To verify this we transform the distribution:

$g(y)\mathrm{d} y= f(x)\mathrm{d}x=f(2y)(2\mathrm{d}y)$ ,

from which we see $g(y)=2f(2y)$ . From this the entropy evaluates to

$\begin{aligned} -\int f(x)\log f(x)\;\mathrm{d}x &= -\int g(y)\log \left( \frac{1}{2}g(y)\right)\;\mathrm{d}y \\ &= -\int g(y)\log g(y)\;\mathrm{d}y+\log 2\end{aligned}$ .

Thus if we have two different parameterisations for the same random variable, we will calculate two different entropies! This seems like a very big problem. If you ask "how random is the distribution of people's heights?", you don't make any reference as to if we measure those heights in centimeters or feet.

Continuous entropy as a length #

The resolution comes from looking at continuous entropy through the gemetric formulation. We discussed in the previous post that exponentiating the discrete entropy measured the volume of sample space occupied by a discrete random variable, where 'volume' was interpreted as 'number of points'. If we have a continuous (and one-dimensional) random variable, the continuous entropy also measures the occupied volume of sample space, but now by 'volume' we mean the one-dimensional 'length'.

To understand this better, consider a uniform distribution on the closed interval $[a,b]$ , where $-\infty < a < b <+\infty$ . If we let $x$ represent a point on the real line, this distribution has probability density

$f(x) = \begin{cases} \frac{1}{b-a} & a<x<b \\ 0 & \mathrm{otherwise}.\end{cases}$

Performing the integral in [eqContinuousEntropy] we find the entropy to be $\log(b-a)$ . To get the geometric meaning we exponentiate this: $e^{h(X)}=b-a$ , which is indeed the 'length' that the uniform distribution occupies.

Let's change coordinates to $y=x/2$ , in which case the probability density becomes

$g(y) = \begin{cases} \frac{1}{(b-a)/2} & \frac{a}{2}<y<\frac{b}{2} \\ 0 & \mathrm{otherwise}.\end{cases}$

The random variable has half the 'length' as before, since it is uniformly distributed over the closed interval $[\frac{a}{2},\frac{b}{2}]$ . We would therefore expect the entropy in these coordinates to be different — the paramerisation dependence isn't a bug, it's a feature! We can confirm this by computing

$-\int g(y)\log g(y)\;\mathrm{d}y = \log\left(\frac{b-a}{2}\right)$ ,

and taking the exponential gives a new length of $\frac{b-a}{2}$ .

If the random variable wasn't uniformly distributed, the probability density would cause it to become more concentrated on some parts of the sample space, so the overall volume would decrease. Thus the continuous entropy measures the volume of a random variable, where now by 'volume' we mean 'length'. This length must necessarily change in different coordinates, which is the reason for the parameterisation-dependence of the continuous entropy.

Continuous and discrete entropies #

Let's take a look at a Gaussian distribution with mean $\bar{x}$ and variance $\sigma$ :

$f(x)\mathrm{d} x=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\bar{x})^2}{2\sigma^2}\right)$ .

The entropy of this works out to be $\frac{1}{2}\log \left(2\pi e\sigma^2\right)$ . This is independent of $\bar{x}$ , which is expected since translating the distribution doesn't change the volume of sample space it occupies. The entropy does however increase with $\sigma$ , since the distribution becomes more spread out.

If $\sigma$ grows smaller the entropy shrinks, eventually becoming negative. The discrete entropy is always positive, which in the geometric interpreation is equivalent to saying that a random variable can't occupy less than a single point in sample space. The continuous entropy however can be negative, but this isn't really an issue. Exponentiating $e^{h(X)}$ we still end up with a positive volume, a negative continuous entropy simply means that the 'length' occupied by the random variable is less than one.

Something interesting happens when we let $\sigma\rightarrow 0$ , in which case the continuous entropy approaches negative infinity. Then $e^{h(X)}=e^{-\infty}=0$ , i.e. the random variable has a volume of zero. The meaning of this is that a Gaussian with zero variance occupies only a single point.

This highlights the contrast between the discrete entropy $H$ and the continuous entropy $h$ . Both of these measure the 'volume' of a random variable, however with different definitons of 'volume'. For the discrete entropy we mean 'number of points', while for the continuous entropy we mean 'length'. If you have a mathematical bent, they correspond to choosing different measures on sample space, an idea I explore in §5.2 of my thesis.

With these ideas in mind, suppose $X$ is a continuous random variable. What would be the discrete entropy $H(X)$ ? Recall that $e^{H(X)}$ counts the number of points in the sample space that $X$ occupies. A continuous random variable occupies an inifinite number of points, thus we must have $H(X)=\infty$ . Similarly, consider $h(A)$ where $A$ is a discrete random variable. Any discrete subset of the real line has volume zero, so we must have $e^{h(A)}=0$ , which immediately gives $h(A)=-\infty$ .

Conclusion #

The continuous entropy is defined analogously to the discrete entropy, replacing the probability with probabillity density and sum with an integral. Despite this, the continuous entropy is not just the extension of the discrete entropy to continous random variables. They both quantify the volume of sample space occupied by a random variable, however the discrete entropy defines volume as 'number of points', while the continuous defines it as 'length'. Because of this the discrete entropy of a continuous random variable will be infinity, while the continuous entropy of a discrete random variable is negative infinity.

We can see then that the geometric interpretation unifies the concepts of discrete and continuous entropy, and shows how they relate to one another. Historically the continuous entropy has been regarded as strange and ill-defined, because it depends on the parameterisation and can be negative. However these features are not only natural, but expected, when we view entropy as quantifying a 'volume'.

We are now able to quantify the 'spread' of a random variable, whether it is discrete or continuous. In the next post we will see how this changes when we learn information, for example by measuring a system. This will let us analyse how 'good' a measurement is &mdas; by what factor does it shrink the spread?

Have any comments or questions? Let me know! Follow @ruvi_l on twitter for more posts like this, or join the discussion on Reddit:

Appendix #

Why you should always write the infinitesimal #

Suppose you are given a circle of random radius $r$ , where the probability density on the possible radii is $g(r)\mathrm{d}r$ . Then the probability of getting a radius between $r_1$ and $r_2$ is

[eqrProbIntegral]: $\int_{r_2}^{r_2} g(r)\mathrm{d} r$ .

Let's be concrete, and take $g(r)$ to be Gaussian distributed about some mean $\bar{r}$ with variance $\sigma$ :

[eqrDensity]: $g(r)\mathrm{d} r=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(r-\bar{r})^2}{2\sigma^2}\right)\mathrm{d} r$ .

We may also describe the circles not by radius $r$ , but by their area $a$ , which is related by $a=\pi r^2$ (or $r=\sqrt{a/\pi}$ ). Then the mean area would be $\bar{a}=\pi\bar{r}^2$ , and the bounds $r_i$ in [eqrProbIntegral] would become $a_i=\pi r_i^2$ . Finally, we need to find the probability density in terms of the areas $h(a)\mathrm{d} a$ .

A common mistake is to takge $g(r)$ and substitute $r=\sqrt{a/\pi}$ . This would give us

[eqhSubstituted]: $h(a)=_?g(\sqrt{a/\pi})=\frac{1}{\sqrt{2\pi \sigma^2}}\exp\left(-\frac{(\sqrt{a}-\sqrt{\bar{a}})^2}{2\sigma^2\pi}\right)$ ,

where the question mark is to emphasise that this is not in fact correct. We can see this by considering units. The right hand side of [eqhSubstituted] has units of $1/\sigma$ , where $\sigma$ has units length since it is the variance of our distribution of radii. However the integral

$\int_{a_1}^{a_2}h(a)\mathrm{d} a$

gives a probability, which is unitless. The infinitesimal $\mathrm{d} a$ has units of area, so $h(a)$ should have units of area^-1.

The correct answer comes from transforming not the density but the integral [eqrProbIntegral]. From the chain rule $\mathrm{d} a = 2\pi r \mathrm{d} r = 2\sqrt{\pi a} \mathrm{d} r$ , so the integral becomes

$\int_{a_1}^{a_2} g(\sqrt{a/\pi}) \frac{\mathrm{d} a}{2\sqrt{\pi a}}$ .

We can therefore see that

$h(a)\mathrm{d }a=\frac{g(\sqrt{a/\pi})}{2\sqrt{\pi a}}\mathrm{d}a$ .

Thus when changing units, we must always remember to transform both the density function, and the infinitesimal. We don't substitute $r=\sqrt{a/\pi}$ into $g(r)$ , we substitute it into $g(r)\mathrm{d} r$ . The infinitesimal is part of the probability density, and that's why I recommend that you always include the infinitesimal.

More generally even if you aren't transforming the distribution, including the infinitesimal is still more natural. A probability density only becomes a probability when you include the infinitesimal, and you need to keep it there for the units to be dimensionless.

The logarithm and units #

Can you take the logarithm of a quantity with units? It is certainly possible if you are sufficiently determined, and in this section we will analyse the consequences of such a reckless decision. It turns out that you can, so long as you take appropriate care.

Firstly we should think about precisely what 'units' mean. We assume a quantity to scale by a given factor under transformations that we call 'change units'. For example, if a quantity has units of centimetres, this means there is an object $\mathrm{cm}$ such that if we perform the transformation centimeteres $\rightarrow$ metres, we have $\mathrm{cm}\rightarrow 0.01\mathrm{m}$ .

With this in mind, let's take the logarithm of $10\,\mathrm{cm}$ . Using $\log(ab)=\log(a)+\log(b)$ , this becomes

$\log(10\,\mathrm{cm})=\log(10)+\log(\mathrm{cm})$ .

If we perform centimeteres $\rightarrow$ metres, then $\mathrm{cm}$ will transform as described above, while a number such as $10$ will stay the same:

$\begin{aligned}\log(0.1\,\mathrm{m}) &= \log(10)+\log(0.01\mathrm{m}) \\ &=\log\left(10\times 0.01\right)+\log\left(\mathrm{m}\right)\end{aligned}$ .

Putting this together, we see that if you take the logarithm of a quantity with units, it ends up with 'logarithmic units'. If you make a log-plot of a quantity with units of centimetres, the axis should have $\log(\mathrm{cm})$ , as a reminder that you need to have an extra $+\log(\mathrm{cm})$ to account for a change in units. Then if someone wants to know what the numbers would be in metres, the answer is to add $\log 0.01$ to each point. Compare this to a regular plot of a quantity in centimeters, where you would multiply each point by $0.01$ when you change units.

Now let's see how this applies to the parameterisation problem discussed earlier. The parameterisations $x$ and $y$ can be seen as different units for the real line. To account for this, we will suppose that $x$ has some units of 'length' which we will call $\mathrm{L}$ . Then $\mathrm{d} x$ also has units of $\mathrm{L}$ , so the equation

$1 = \int f(x) \;\mathrm{d} x$

requires $f(x)$ to have units $\mathrm{L}^{-1}$ .

The continuous entropy is

$h = -\int \left(f(x)\mathrm{L}^{-1}\right)\log\left(f(x)\mathrm{L}^{-1}\right)\;\mathrm{d} (x\mathrm{L})$ .

The factors of $\mathrm{L}$ and $\mathrm{L}^{-1}$ outside the logarithm cancel, giving
$\begin{aligned} &=-\int f(x)\log f(x)\mathrm{d} x-\int f(x)\log\left(\mathrm{L}^{-1}\right)\mathrm{d} x \\ &= -\int f(x)\log f(x)\mathrm{d} x+\log(\mathrm{L}) \\ &= h(f)\rvert_x+\log(\mathrm{L})\end{aligned}$ ,

where $h(f)\rvert_x$ denotes the entropy of the probability density $f$ when calculated in units of $x\sim L$ .

The transformation $y=2x$ can be seen as choosing new units $L' = 2L$ . Then:

$\begin{aligned}h(f)\rvert_y+\log(L')&=h(f)\rvert_x+\log(L'/2)\\ &=\left(h(f)\rvert_x-\log 2\right)+\log(L')\end{aligned}$ .

This gives us the result that we found before.

References #

The discussion on continuous entropy was mostly inspired by
[CT05] Cover, T. M., & Thomas, J. A. (2005). Elements of Information Theory. In Elements of Information Theory. John Wiley & Sons. https://doi.org/10.1002/047174882X