Entropy as a ‘volume’ – the geometric interpretation

1. Introduction

This is the first in a series of posts which will introduce classical information theory using the geometric framework — in contrast to the often-used viewpoint of binary bits. The goal is to understand how to use information theory to study probability distributions, and how they change as we gain or lose information. This article shows how the geometric formulation of entropy provides a nice way to visualise probability distributions as occupying `volumes’ in the space of possible outcomes.

The series is based on material from chapter four of my thesis, where I used these ideas to study measurement in quantum systems. We won’t talk about quantum mechanics here, but I will briefly describe the problem so you can get an idea of what we can do with these tools. Suppose we wish to measure the strength of a magnetic field, whose value we denote by a {\Phi}. We will have some prior knowledge about the field — for example it may be Gaussian distributed about some field strength {\phi_0} — and so {\Phi} has a known probability distribution. We take a quantum state {\rho}, interact this with the field, and then make a measurement on {\rho}, the outcome {M} of which also follows a known probability distribution given by quantum mechanics. What information does {M} contain about {\Phi}? Information theory and the geometric formulation are useful tools to attack this problem.

In what follows I will avoid mathematical rigour and make intuitive arguments. For a more complete development see [Wil17] and [CT05], on which this introduction is based.

2. Probability preliminaries

This post will use the language of modern probability theory, so here we briefly introduce the necessary vocabulary. The mathematical object used to model random events is called a random variable, which is a set of outcomes, called the sample space, together with an associated probability distribution. When we observe the outcome of a random variable we say that we sample it. For example a dice roll has a sample space {\{1,2,3,4,5,6\}}, each with an equal probability of {1/6}, and sampling this twice might result in outcomes `2′ and `4′. This is an example of a discrete random variable, where the sample space is a discrete set. Other random variables are continuous, such as if you were to sample the height of a random person.

Returning to the magnetic field, the sample space is the set of possible field strengths. This will likely be continuous. The probability distribution then encodes our prior knowledge of which outcomes are more or less likely.

3. Discrete entropy

Suppose {X} is a discrete random variable, with {N} possible outcomes in the sample space. If the {i}th outcome has probability {p_i}, the entropy of {X} is defined as

\displaystyle H(X)=-\sum_{i=1}^Np_i\log p_i=\sum_{i=1}^Np_i\log\left(\frac{1}{p_i}\right). \ \ \ \ \ (1)

Unless explicitly stated all logarithms will be in base {e}, a choice discussed at the end of section 4. We will spend the rest of this section studying the meaning of this sum.

Suppose you were to sample {X}, with knowledge of the probability distribution of samples. The term {\log(1/p_i)} quantifies how much information you would gain upon observing the {i}th outcome:

  • If {p_i=1} then you already knew the outcome beforehand so gain no new information: {\log(1/1)=0}. The term {\log(1/p_i)} is small for values of {p_i\approx 1}. Highly probable events do not represent a large gain in information, as you were already `expecting’ these from the probability distribution. If NASA were to announce that they searched the skies and could find no large asteroids heading towards the Earth, they would not be telling us much more than we already assumed.
  • If {p_i=\epsilon\ll 1} then witnessing this represents a significant amount of `surprise’: the information gain {\log(1/\epsilon)} is very large. You would feel like you had learned a lot of information if NASA were to announce that they had found a large asteroid on a collision course with Earth.

The entropy therefore characterises the average amount of information you gain per observation. A highly random probability distribution evenly spread among a large number of outcomes will always `surprise’ you, and has high entropy. On the other hand, a very predictable distribution concentrated amongst a few highly probable outcomes has low entropy. It can be shown that entropy takes its minimal value of {H(X)=0} if {p_i=1} for some {i}, and is bounded by {H(X)\le \log N} — with equality achieved only for the uniform distribution. (The latter is shown most easily using the method of Lagrange multipliers, see for example [Wil17,§ 10.1].)

As an illustration let’s consider a biased coin where heads has probability {p}, and tails {1-p}. The entropy of this distribution is called the binary entropy:

\displaystyle H(p)=-p\log p-(1-p)\log(1-p). \ \ \ \ \ (2)

We plot {H(p)} below:

The binary entropy is zero in the deterministic case when {p} is either one or zero. For these values the coin is either always heads or always tails; you know the outcome before the coin toss so gain no new information by observing the outcome. The maximum value of {\log(2)} is attained when the {p=0.5} and the distribution is as `random’ as possible.

4. The volume of a random variable

In this section we show that entropy gives us the `volume’ occupied by a probability distribution, the so-called geometric interpretation of information theory [Wil17,§ 2]. Entropy was introduced by Claude Shanon as a tool of the information age [Sha48], when messages were being transmitted as binary sequences of zeros and ones. We will consider an example along these lines, so it will be convenient to take our logarithms to be in base two which we indicate with a subscript.

Suppose our random variable {X} represents a letter in some alphabet, and sampling this multiple times gives us a message we wish to transmit. The probability distribution of {X} is the relative frequency with which each letter is used. This could be seen as a crude model for the written language, or {X} may be the outcome of some experiment or sensor readout that we wish to transmit to someone else. Shannon’s key insight was that the probability distribution of {X} introduces a fundamental limit on how efficiently we may encode a message as binary — the more `spread out’ the distribution, the less efficient an encoding we may use.

We consider an alphabet made up of four letters: {\{a,b,c,d\}}. First suppose that in our language each occurred with equal frequency, so the probability distribution was uniform: {p_i=1/4}. In this case the entropy of {X} is

\displaystyle H_2(X)=4\times\left(-\frac{1}{4}\log_2\frac{1}{4}\right)=2, \ \ \ \ \ (3)

with the subscript in `{H_2}‘ to denote the base 2 logarithm. If we wish to encode a message in binary, Shannon’s fundamental result was to show that we would need {H_2(X)} bits per character [Sha48] — the higher the entropy, the less efficient an encoding we can achieve. We will take this on faith here, for a discussion on how to make this argument see [Wil17,§ 2.1]. One way of encoding the outcome of {X} is

\displaystyle \{(a,00),(b,01),(c,10),(d,11)\}, \ \ \ \ \ (4)

in which case a message {aaba} would be {00000110}. If {X} is uniformly distributed then two bits per character is the best we can do.

Letters do not however appear with equal frequency, and we may use this structure to create a more efficient encoding. As an extreme example consider a different language {Y} with the same four letters, but suppose that {a} occurred {97\%} of the time — while {b}, {c}, and {d} each had probability {1\%}. We could then use the code

\displaystyle \{(a,0),(b,100),(c,101),(d,111)\}, \ \ \ \ \ (5)

which would write {aaba} as {001000}. The number of bits used per letter now varies, but a string of ones and zeros can still be unambiguously decoded. A 0 represents an `a’, while a 1 means that the next two bits will tell you if it is a `b’, `c’, or `d’. Since {a} appears more frequently, {97\%} of the time we would only use a single bit. The expected number of bits per character with this encoding is then {0.97\times 1+0.01\times 3\times 3=1.06} (thanks u/CompassRed for pointing out a missing 3), so this code is twice as efficient as what we had for {X}. Calculating the entropy we find {H_2(Y)\approx 0.24}, so we are still about four times less efficient than the optimal code. One way do to better would be to encode multiple characters at a time. Since `{a}‘ occurs so often we might let {0} represent the sequence `aaa’, and this way the average bits per letter would be less than one.

The two random variables {X} and {Y}, despite occupying the same sample space, ended up having very different `sizes’. We would require two bits per character to transmit {X} as a binary signal, as opposed to only {0.24} for {Y}. This `size’ is determined by the probability distribution, and quantified by the entropy. Thinking about random variables in this manner is called the geometric interpretation of entropy.

In this picture we see our random variable as occupying a sub-volume of the sample space due to the structure of its probability distribution. It may seem strange for us to use the word `volume’ for a discrete random variable, when we really mean `number of points’. Points, length, area, and our usual three-dimensional `volume’ are all ways of measuring `sizes’, and which one is relevant depends on the dimension of problem at hand. In this series we will use the word `volume’ to refer to all of them. As we will discuss later this can all be made rigorous through the lens of measure theory (or see § 5.2 of my thesis if you are impatient).

The volume occupied by a random variable is found by exponentiating the entropy to the power of whichever base we took our logarithm in.

  • In the example just covered {2^{H_2(X)}=2^2=4}, and so the random variable {X} — which is uniformly distributed — occupies the entire sample space.
  • Suppose we had a variable {Z} which always used the letter {b} with probability one. We would have {H_2(Z)=0}, and {2^{H_2(Z)}=2^0=1}. The variable occupies only a single letter as expected. For our variable {Y} we have {2^{H_2(Y)}=2^{0.24}\approx 1.18}, so this occupies only slightly more than a single letter.
  • If a variable {Z'} were evenly distributed between {a} and {b} with probability {1/2}, the entropy would be

    \displaystyle -\frac{1}{2}\log\frac{1}{2} -\frac{1}{2}\log\frac{1}{2}=\log 2, \ \ \ \ \ (6)

    and {2^{H_2(Z')}=2^{\log 2}=2}, occupying exactly two letters out of the sample space. (Thanks to u/Avicton and u/inetic on Reddit for pointing out a typo in the original Eq. (6))

If we return to the binary entropy of a biased coin, the sample space had two elements: heads and tails. Exponentiating H(p) gives:

We can see that in the deterministic case when p=0 the random variable occupies a single element, either heads or tails. The full sample space is occupied when the coin is as random as possible, at p=0.5.

Let us briefly discuss the role of the base two logarithm. The final `volumes’ are exactly the same as if we used the natural logarithm, since:

\displaystyle 2^{H_2(X)}=e^{H(X)}. \ \ \ \ \ (7)

The only role of base two was the intermediate step, when we interpreted {H_2(X)} as the number of bits required to transmit each character. This was because we were signalling in binary — if we were sending our signals in trinary for example (with three `trits’ rather than two `bits’) then we would have had to reach for the base three logarithm instead.

Often in literature base two is used, reflecting the role of information theory in studying digital communications. We however don’t want to study transmitting information in binary, but rather how the volume of a random variable changes as we gain information. In so far as any basis is natural, we may as well choose the natural logarithm in base {e}.

5. Conclusion

In the geometric picture we visualise a random variable {X} as occupying a sub-volume of its sample space, which is given by {e^{H(X)}}. A uniform distribution is spread out evenly over the entire sample space, while a deterministic random variable occupies only a single point.

In future posts we will see that the geometric picture offers a nice way to unify the entropy of a continuous and discrete random variable. All that you need to do is switch your idea of size from the discrete ‘number of points’ to the continuous analogues of `length’, `area’, or three-dimensional `volume’. This will be particularly useful when we want to describe the interaction between continuous and discrete quantities. In the quantum example I discussed at the start, the measurement {M} has discrete outcomes, from which we want to infer the continuous value of the magnetic field.

The next post will discuss correlations between random variables, such as when you observe a measurement {M} to try and learn about some parameter {\Phi}. We will see how the volume of sample space occupied by the random variable shrinks as you gain more information about it, until it collapses to a single point with perfect information.

Let me know if you have any questions or comments about this article. Follow @ruvi_l on twitter for more posts like this, or join the discussion on Reddit:

6. Notes and further reading

7. References

[Wil17] Wilde, M. (2017). Quantum information theory (Second Edition). Cambridge University Press.

[CT05] Cover, T. M., & Thomas, J. A. (2005). Elements of Information Theory. In Elements of Information Theory. John Wiley & Sons. https://doi.org/10.1002/047174882X

[Sha48] Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

This post was written using the excellent tool LaTeX to WordPress:


The Transfer Matrix for a Boundary

1. Introduction

Let’s apply the material from Making Friends with Electromagnetic Boundary Conditions to calculate a transfer matrix. We will consider a TE wave travelling between materials with parameters {\mu_1,\epsilon_1} and {\mu_2,\epsilon_2}:


There are a few things to note:

  • This is a TE wave, so the electric field {\mathbf{E}} is transverse to the plane of incidence. In this case we chose to let {\mathbf{E}} point into the page, which then determines the direction of the magnetic field since {(\mathbf{k},\mathbf{E},\mathbf{B})} form a right-handed coordinate system.
  • The total wavevector {\mathbf{k}} is made up of a component {k_z} in the {z}-direction and {k_x} in the {x} direction.
  • The field on each side is made up of a forward-propagating wave ({a_1,c_1}) and a backward propagating wave ({b_1,d_1}). On the left we have

    \displaystyle E=\left(a_1e^{i(k_{1z}z-\omega t)}+b_1e^{i(-k_{1z}z-\omega t)}\right)e^{ik_xx}, \ \ \ \ \ (1)

    and similarly for the right.

We want {(a_1,b_1)} to represent the electric field in the first medium, but do we want them to represent {\mathbf{E}} or {\mathbf{D}}? To decide, let’s take a look at the boundary conditions:

\displaystyle \begin{gathered} \hat{\mathbf{n}}\cdot\left(\mathbf{B}_2-\mathbf{B}_1\right)=0;\; \hat{\mathbf{n}}\times\left(\mathbf{E}_1-\mathbf{E}_2\right)=0, \\ \hat{\mathbf{n}}\cdot\left(\mathbf{D}_2-\mathbf{D}_1\right)=0;\;\hat{\mathbf{n}}\times\left(\mathbf{H}_2-\mathbf{H}_1\right)=0. \end{gathered} \ \ \ \ \ (2)

The electric field is perpendicular to the boundary, so {\hat{\mathbf{n}}\cdot(\mathbf{D}_2-\mathbf{D}_1)=0}. On the other hand since {\mathbf{E}} is perpendicular to {\hat{\mathbf{n}}}, you can convince yourself that the other boundary condition for the electric field gives {\mathbf{E}_1=\mathbf{E}_2}. Thus the electric field is continuous across the boundary, so we want {(a_1,b_1)} to represent the electric field. In this case if {a_1,b_1} are the fields just before the boundary and {c_1,d_1} the fields just after, we have

\displaystyle \begin{aligned} \mathbf{E}_1 &= \mathbf{E}_2, \\ a_1+b_1&=c_1+d_1. \end{aligned} \ \ \ \ \ (3)

Note that we have chosen coordinates such that the exponentials in Eq. (1) are one right at the boundary.

Aside: If we had a TM wave we would choose {(a_1,b_1)} to represent {\mathbf{H}} rather than {\mathbf{B}}, since {\mathbf{H}} is what would be continuous across the boundary and we would again have {a_1+b_1=c_1+d_1}. Note that this is just a convenience; you could use {\mathbf{B}} if you really wanted to, and using {\mathbf{B}=\mu \mathbf{H}} the boundary condition would then become {(a_1+b_1)/\mu_1=(c_1+d_1)/\mu_2}.

We need one more relation between the {a_1,b_1,c_1,d_1}, and we have two two boundary conditions we haven’t used so far:

\displaystyle \hat{\mathbf{n}}\cdot(\mathbf{B}_2-\mathbf{B}_1)=0;\;\hat{\mathbf{n}}\times(\mathbf{H}_2-\mathbf{H}_1)=0. \ \ \ \ \ (4)

These are in terms of the magnetic field, which we want to relate to the electric field. This comes from Maxwell’s equation:

\displaystyle \nabla\times\mathbf{E}=-\partial_t\mathbf{B}=i\omega\mathbf{B}. \ \ \ \ \ (5)

What would happen if we took the {\mathbf{B}} equation? Since {\mathbf{B}} is being dotted with {\hat{\mathbf{n}}}, we see that the {z}-component of {\mathbf{B}} is continuous across the boundary. This tells us that {(\nabla\times\mathbf{E})_z} is continuous, which is:

\displaystyle (\nabla\times\mathbf{E})_z=\frac{\partial E_y}{\partial x}-\frac{\partial E_x}{\partial y}=\frac{\partial E_y}{\partial x}, \ \ \ \ \ (6)

since only {E_y} is nonzero. Meanwhile the {\mathbf{H}} equation gives us a relation for the {x}-component of {\mathbf{H}}, and

\displaystyle (\nabla\times\mathbf{E})_x=\frac{\partial E_z}{\partial y}-\frac{\partial E_y}{\partial z}=-\frac{\partial E_y}{\partial z}. \ \ \ \ \ (7)

The wave is propagating in the {z} direction, so this is what we want.

Using {\mathbf{B}=\mu\mathbf{H}}, Eq. (7) tells us that

\displaystyle \left.\frac{1}{\mu_1}\frac{\partial E_y}{\partial z}\right\rvert_{Left}=\left.\frac{1}{\mu_2}\frac{\partial E_y}{\partial z}\right\rvert_{Right}. \ \ \ \ \ (8)

Looking at \cref{eq:TotalEField} the left hand side is

\displaystyle \frac{ik_{1z}}{\mu_1}\left(a_1-b_1\right), \ \ \ \ \ (9)

while the right hand side will be

\displaystyle \frac{ik_{2z}}{\mu_2}\left(c_1-d_1\right). \ \ \ \ \ (10)

Thus this boundary condition is:

\displaystyle a_1-b_1=\frac{\mu_1}{\mu_2}\frac{k_{2z}}{k_{1z}}(c_1-d_1). \ \ \ \ \ (11)

The last thing is to write {k_{2z}/k_{1z}} in terms of the parameters of the material. We know that {|k|=\omega\sqrt{\mu\epsilon}}, and {k_x^2+k_z^2=k^2}. Thus

\displaystyle \begin{aligned} k_{1z} &= \sqrt{\omega^2\mu_1\epsilon_1-k_x^2}, \\ k_{2z} &= \sqrt{\omega^2\mu_2\epsilon_2-k_x^2}. \\ \end{aligned} \ \ \ \ \ (12)

Eq. (11) thus gives us our second boundary equation:

\displaystyle a_1-b_1=\frac{\mu_1}{\mu_2}\sqrt{\frac{\omega^2\mu_2\epsilon_2-k_x^2}{\omega^2\mu_1\epsilon_1-k_x^2}}(c_1-d_1). \ \ \ \ \ (13)

This is in terms of the frequency of the wave {\omega}, the parameters {\mu_i,\epsilon_i} of the material, as well as {k_x}, which is a parameter telling us the angle of the input wave. For normal incidence we have {k_x=0}.

2. References

[1] The LaTeX was written using the excellent tool LaTeX to WordPress:

LaTeX to WordPress



Making Friends with Electromagnetic Boundary Conditions

1. Introduction

In The Meaning of Maxwell’s Equations we looked at the geometric meaning of Maxwell’s equations:

\displaystyle \begin{gathered} \nabla\cdot \mathbf{B}=0;\;\nabla\times \mathbf{E}+\frac{\partial \mathbf{B}}{\partial t}=0, \\ \nabla\cdot \mathbf{E}=\frac{\rho}{\epsilon_0};\; \nabla\times \mathbf{B}-\epsilon_0\mu_0\frac{\partial \mathbf{E}}{\partial t}=\mu_0\mathbf{j}. \end{gathered} \ \ \ \ \ (1)

The key thing to remember was: if you see a divergence draw a three-dimensional volume and use the divergence theorem, while if you see a curl draw a two-dimensional surface and use Stokes’ theorem. We are going to use this to derive the electromagnetic boundary conditions. Hopefully by the end of this they should seem pretty intuitive, and you will be able to quickly guess the boundary conditions just by looking at Maxwell’s equations.

Now things are complicated by the fact that if we are not in a vacuum, the electric and magnetic fields can induce extra charges {\rho} and currents {j} in the surface, which induce new fields, which induce even more charges and currents, and so on. For this reason it is convenient to write some of the equations in terms of {D} and {H}, because then only the `free’ charges {\rho_f} and currents {j_f}, the ones that are put in at the start rather than induced by the fields, show up:

\displaystyle \begin{gathered} \nabla\cdot \mathbf{B}=0;\;\nabla\times \mathbf{E}+\frac{\partial \mathbf{B}}{\partial t}=0, \\ \nabla\cdot \mathbf{D}=\rho_f;\; \nabla\times \mathbf{H}-\frac{\partial \mathbf{D}}{\partial t}=\mathbf{j}_f. \end{gathered} \ \ \ \ \ (2)

2. Deriving the boundary conditions

Let’s suppose we have a boundary between two media, and some sort of electromagnetic wave propagates between the two. We want to know how the fields on one side relate to the fields on the other.


We will look at the conditions imposed on the fields by each of Maxwell’s laws.

2.1. Gauss’ Law for the Magnetic Field

For our first candidate we will look at

\displaystyle \nabla\cdot\mathbf{B}=0. \ \ \ \ \ (3)

There is a divergence, so that means we want to draw a three-dimensional box on both sides of the boundary, and use the divergence theorem to convert the left hand side to an integral of {\mathbf{B}\cdot\hat{\mathbf{n}}} over the surface.


We have a solid rectangular box. The blue face lies inside the left region, the green face inside the right region, while the black faces cross both regions. Integrating both sides of Gauss’ Law gives that the integral of {\mathbf{B}\cdot\hat{\mathbf{n}}} over all six faces is equal to zero:

\displaystyle 0=\int_{\mathrm{Blue\;face}}\mathbf{B}_1\cdot\hat{\mathbf{n}} +\int_{\mathrm{Green\;face}}\mathbf{B}_2\cdot\hat{\mathbf{n}}+\int_{\mathrm{Black\;faces}}\mathbf{B}\cdot\hat{\mathbf{n}}. \ \ \ \ \ (4)

Let’s suppose the blue and green faces have some area {A}. Note that the unit normal {\hat{\mathbf{n}}} has opposite sign on each side of the box, so it is pointing against {\mathbf{B}_1} but with {\mathbf{B}_2}. On the blue and green faces, taking the dot product with the unit normal selects the components of the magnetic field which are perpendicular to the boundary. We then have:

\displaystyle 0=-|\mathbf{B}_1^{\perp}|A+|\mathbf{B}_2^{\perp}|A+\int_{\mathrm{Black\;faces}}\mathbf{B}\cdot\hat{\mathbf{n}}. \ \ \ \ \ (5)

The integral over the black faces is annoying because each face has contributions from both {B_1} and {B_2}, so it doesn’t have a nice answer. We can get rid of these parts though by making the box extremely thin. We bring the blue and green face closer together, shrinking the black faces to zero area while keeping the areas of the blue and green faces at {A}. With this we will have

\displaystyle 0=-|\mathbf{B}_1^{\perp}|A+|\mathbf{B}_2^{\perp}|A. \ \ \ \ \ (6)

Dividing through by area then gives

\displaystyle |\mathbf{B}_1^{\perp}|=|\mathbf{B}_2^{\perp}|. \ \ \ \ \ (7)

In other words, the perpendicular component of the magnetic field is unchanged across the boundary. Since the perpendicular component of the magnetic field is {\mathbf{B}\cdot\hat{\mathbf{n}}}, another way of writing this boundary condition is

\displaystyle \hat{\mathbf{n}}\cdot\left(\mathbf{B}_2-\mathbf{B}_1\right)=0 \ \ \ \ \ (8)

2.2. Gauss’ Law for the Electric Field

Now we will analyse

\displaystyle \nabla\cdot\mathbf{D}=\rho_f. \ \ \ \ \ (9)

The analysis for the left hand side is identical to what we had for the magnetic field, and we end up with

\displaystyle -|\mathbf{D}_1^{\perp}|A+|\mathbf{D}_2^{\perp}|A. \ \ \ \ \ (10)

On the right hand side however the volume integral of {\rho_f} over the box will be equal to the enclosed free charge. Let’s assume there is a charge density {\sigma_f} per unit area on the boundary, then the right hand side will be {\sigma_f A}, and we will have

\displaystyle -|\mathbf{D}_1^{\perp}|A+|\mathbf{D}_2^{\perp}|A=\sigma_f A, \ \ \ \ \ (11)


\displaystyle -|\mathbf{D}_1^{\perp}|+|\mathbf{D}_2^{\perp}|=\sigma_f. \ \ \ \ \ (12)

As before we may re-write this as

\displaystyle \hat{\mathbf{n}}\cdot\left(\mathbf{D}_2-\mathbf{D}_1\right)=\sigma_f. \ \ \ \ \ (13)

Let’s consider a special case. Suppose there is no free charge on the boundary ({\sigma_f=0}), and we are in linear media, so {\mathbf{D}_1=\epsilon_1\mathbf{E}_2} and {\mathbf{D}_2=\epsilon_2\mathbf{E}_2}. In this case the boundary condition becomes:

\displaystyle \hat{\mathbf{n}}\cdot\left(\epsilon_2\mathbf{E}_2-\epsilon_1\mathbf{E}_1\right)=0, \ \ \ \ \ (14)

and we see that the normal component of the electric field is discontinuous across the boundary. The intuition for this is that the electric fields induce a bound charge density on the boundary, which causes the normal component of the electric fields to be discontinuous.

2.3. Faraday’s Law

We’re done with the divergences, so let’s move onto the simpler one of the curl equations:

\displaystyle \nabla\times \mathbf{E}=-\frac{\partial \mathbf{B}}{\partial t}. \ \ \ \ \ (15)

Since we have a curl, this time we will draw a two-dimensional surface and then use Stokes’ theorem.


Now the blue line lies inside the left material, the green line inside the right material, and the black lines cross the boundary and lie in both materials. We integrate both sides of Faraday’s law over this surface. For the left hand side Stokes’ theorem converts the integral of the curl over this surface to the line integral around the boundary:

\displaystyle \int_{\mathrm{Blue\;line}}\mathbf{E}_1\cdot d\mathbf{l} +\int_{\mathrm{Green\;line}}\mathbf{E}_2\cdot d\mathbf{l}+ \int_{\mathrm{Black\;lines}}\mathbf{E}\cdot d\mathbf{l}. \ \ \ \ \ (16)

Suppose the blue and green lines have length {l}, then the line integrals over these become the parallel component of {\mathbf{E}} multiplied by {l}, with again a sign difference because the line integral is pointing down on the blue side and up on the green side (as it goes anticlockwise around the rectangle)

\displaystyle =|\mathbf{E}_1^{\parallel}|l-|\mathbf{E}_2^{\parallel}|l+ \int_{\mathrm{Black\;lines}}\mathbf{E}\cdot d\mathbf{l}. \ \ \ \ \ (17)

Again the integrals over the black line are annoying, as each line crosses between the two regions. But again the cure is the same, to move the blue and green lines closer together (keeping them at length {l}), squeezing the rectangle thinner and thinner until the lengths of the black lines go to zero. In this case the integral over the black lines vanishes and we have

\displaystyle =|\mathbf{E}_1^{\parallel}|l-|\mathbf{E}_2^{\parallel}|l \ \ \ \ \ (18)

For the right hand side we want to integrate {-\partial_t\mathbf{B}} over the face of the rectangle. However we just squeezed the rectangle infinitely thin, so we will be integrating this over zero area, which will give us zero. The net result is

\displaystyle |\mathbf{E}_1^{\parallel}|l-|\mathbf{E}_2^{\parallel}|l=0, \ \ \ \ \ (19)

and then dividing by {l} we find that the parallel component of the electric field is continuous across the boundary. Since the parallel component is the part perpendicular to the normal vector, we can also write this as

\displaystyle \hat{\mathbf{n}}\times\left(\mathbf{E}_1-\mathbf{E}_2\right)=0. \ \ \ \ \ (20)

2.4. Ampére’s Law

This one is left as an exercise to the reader! Begin with

\displaystyle \nabla\times \mathbf{H}=\mathbf{j}+\frac{\partial\mathbf{D}}{\partial t}, \ \ \ \ \ (21)

and show that you end up with

\displaystyle \hat{\mathbf{n}}\times\left(\mathbf{H}_2-\mathbf{H}_1\right)=\mathbf{j}_s \ \ \ \ \ (22)

where {\mathbf{j}_s} is the surface current density. If there is no free current and we are in a linear material ({\mathbf{B}=\mu\mathbf{H}}), this becomes

\displaystyle \hat{\mathbf{n}}\times\left(\frac{1}{\mu_2}\mathbf{B}_2-\frac{1}{\mu_1}\mathbf{B}_1\right)=0. \ \ \ \ \ (23)

3. Conclusion

There you have it! Once you understand the general principle, you can read off the boundary conditions very quickly by just looking at Maxwell’s laws:

\displaystyle \begin{gathered} \nabla\cdot \mathbf{B}=0;\;\nabla\times \mathbf{E}+\frac{\partial \mathbf{B}}{\partial t}=0, \\ \nabla\cdot \mathbf{E}=\frac{\rho}{\epsilon_0};\; \nabla\times \mathbf{B}-\epsilon_0\mu_0\frac{\partial \mathbf{E}}{\partial t}=\mu_0\mathbf{j}, \end{gathered} \ \ \ \ \ (24)

The divergence laws will tell you about the perpendicular components, while the curl laws tell you about the parallel. The homogenous (source-free) laws give you continuity (the same fields on either side), while the inhomogeneous laws lead to discontinuity (factors of {\mu} and {\epsilon} on either side). After a bit of thinking you should be able to jump straight to

\displaystyle \begin{gathered} \hat{\mathbf{n}}\cdot\left(\mathbf{B}_2-\mathbf{B}_1\right)=0;\; \hat{\mathbf{n}}\times\left(\mathbf{E}_1-\mathbf{E}_2\right)=0, \\ \hat{\mathbf{n}}\cdot\left(\mathbf{D}_2-\mathbf{D}_1\right)=\sigma_f;\;\hat{\mathbf{n}}\times\left(\mathbf{H}_2-\mathbf{H}_1\right)=\mathbf{j}_s. \end{gathered} \ \ \ \ \ (25)

4. References

[1] The LaTeX was written using the excellent tool LaTeX to WordPress:

LaTeX to WordPress



The Meaning of Maxwell’s Equations

1. Introduction

Let’s take a look at Maxwell’s equations (in differential form):

\displaystyle \begin{gathered} \nabla\cdot \mathbf{B}=0;\;\nabla\times \mathbf{E}+\frac{\partial \mathbf{B}}{\partial t}=0, \\ \nabla\cdot \mathbf{E}=\frac{\rho}{\epsilon_0};\; \nabla\times \mathbf{B}-\epsilon_0\mu_0\frac{\partial \mathbf{E}}{\partial t}=\mu_0j. \end{gathered} \ \ \ \ \ (1)

Let’s try and understand what these mean geometrically, and how you can go about using them.

2. Some Vector Calculus

Firstly we need some vector calculus. Let’s start off with some vector field {\mathbf{A}=(A_x,A_y,A_z)}. The divergence of {\mathbf{A}} is given by

\displaystyle \nabla\cdot\mathbf{A}=\partial_xA_x+\partial_yA_y+\partial_zA_z. \ \ \ \ \ (2)

What does the divergence mean intuitively? Imagine placing a tiny sphere at some point {\mathbf{p}=(x_0,y_0,z_0)}, and letting the surface of the sphere be pushed and pulled by the vector field {\mathbf{A}}. Depending on the vector field the surface of the sphere will be distorted, and its volume will change. The rate of change of volume is given by the divergence of {\mathbf{A}} at {\mathbf{p}}. If the divergence is positive, that means the volume of the sphere will increase. If the divergence is negative, then the volume of the sphere will decrease. If the divergence is zero then the shape of the sphere may be distorted, but in such a way that the volume remains constant.

The divergence is related to the divergence theorem. Let {V} be some solid volume, {\partial V} its surface, and {\hat{\mathbf{n}}} the normal vector. For if {V} were the solid ball of radius {1}, then {\partial V} would be the surface of that ball, namely the sphere of radius {1}, and {\hat{\mathbf{n}}} the unit normal vector on the sphere. The divergence theorem relates the integral of the divergence of {\mathbf{A}} over {V}, with the integral of {\mathbf{A}\cdot\hat{\mathbf{n}}} over the surface of {V}:

\displaystyle \int_V\nabla\cdot\mathbf{A}\,dV=\int_{\partial V}\mathbf{A}\cdot \hat{\mathbf{n}}\,dS. \ \ \ \ \ (3)


Imagine an incompressible fluid in three dimensions, being pushed around by {\mathbf{A}}. If the divergence is positive at a point then fluid is being created and pushed outwards. If the divergence is negative then the fluid is being sucked away, while if the divergence is zero then the vector field is pushing the fluid around, without creating or destroying it. The left hand side of Eq. 3 is the sum over the entire volume {V} of how much fluid is being created or sucked up. Now let’s look at the right hand side. The dot product {\mathbf{A}\cdot\hat{\mathbf{n}}} asks how much fluid is being pushed through the boundary; if the dot product is positive then fluid is being pushed out of the surface, if the dot product is negative then fluid is being pushed into the surface, while if the dot product is zero then fluid is circulating around the surface, without going inwards or outwards.

In other words the divergence theorem says that the sum of all the fluid being created or sucked up at each point in the entire volume {V} is equal to the net amount of fluid that gets pushed into or out of the surface.

Next we have the curl:

\displaystyle \begin{aligned} \nabla\times\mathbf{A} &= (\partial_x,\partial_y,\partial_z)\times(A_x,A_y,A_z), \\ &=\left(\partial_yA_z-\partial_zA_y,\partial_zA_x-\partial_xA_z,\partial_xA_y-\partial_yA_x\right). \end{aligned} \ \ \ \ \ (4)

To interpret the curl, imagine placing a tiny sphere at some point {\mathbf{p}}, but fix it in place so that it cannot move. Let’s suppose this sphere is rigid, so that it’s surface cannot be stretched. You can imagine {\mathbf{A}} at each point on the surface of the sphere giving it a little push or pull. If we let all these pushes and pulls add up, the sphere will start to rotate. The magnitude of the curl tells you how fast the sphere will rotate due to its surface being pushed by {\mathbf{A}}, and the direction of the curl tells you the axis the sphere will rotate around.

The curl is related to Stokes’ theorem. Let {\Sigma} be a two-dimensional solid region with normal vector {\hat{\mathbf{n}}}, and {\partial\Sigma} the one-dimensional boundary of {\Sigma}. Stokes’ theorem relates the integral of the curl over {\Sigma} to the line integral of {\mathbf{A}} around the boundary:

\displaystyle \int_{\Sigma}\nabla\times\mathbf{A}\cdot\hat{\mathbf{n}}\,dS=\int_{\partial\Sigma}\mathbf{A}\cdot d\mathbf{l}. \ \ \ \ \ (5)

The left hand side gives the integral over {\Sigma} of the circulation of the vector field in the plane of {\Sigma}. The right hand side gives the net circulation of {\mathbf{A}} around the boundary.

Stokes’ theorem says that the sum of circulation of fluid at every point of a two-dimensional surface is equal to the net circulation around the boundary of the surface.


3. The Meaning of Maxwell’s Equations

Armed with our knowledge of vector calculus, let’s take another look at Maxwell’s equations. We’ll begin with the divergence of the magnetic field:

\displaystyle \nabla\cdot\mathbf{B}=0. \ \ \ \ \ (6)

This equation says that there are no `sources’ or `sinks’ of the magnetic field lines. The magnetic field is neither created nor destroyed, it just flows from one place to another. If you draw a solid region, there is just as much magnetic field coming into the region as coming out. Things are slightly different for the electric field however:

\displaystyle \nabla\cdot\mathbf{E}=\frac{\rho}{\epsilon_0}. \ \ \ \ \ (7)

If there is no charge in a region of space, then electric field lines are also neither created nor destroyed. If you have positive charge however this acts as a source of electric field lines, and a region enclosing positive charge will on the whole have electric field being `produced’ inside and flowing outwards from the surface. Negative charge on the other hand acts as a sink, `sucking in’ the electric field. If you consider a region enclosing negative charge, the electric field will flow inwards through the boundary.

The moral of the story is every time you see a divergence {\nabla\cdot\mathbf{A}} in Maxwell’s equations, imagine drawing a three-dimensional volume and use the divergence theorem to convert this to an integral of {\mathbf{A}\cdot\hat{\mathbf{n}}} over the surface.

Similarly every time you see a curl {\nabla\times\mathbf{A}} in Maxwell’s equations, draw a two-dimensional surface and use Stokes’ theorem to convert this to an integral of {\mathbf{A}\cdot d\mathbf{l}} around the boundary.

Let’s see this with Faraday’s law:

\displaystyle \nabla\times \mathbf{E}=-\frac{\partial\mathbf{B}}{\partial t}. \ \ \ \ \ (8)

Integrate both sides of this over a two-dimensional surface. The right hand side will be the rate of change of the flux of {\mathbf{B}} through the surface. If the flux is changing, this will induce an electric field circulating around the boundary of this surface. The case is similar for

\displaystyle \nabla\times\mathbf{B}=\mu_0j+\epsilon_0\mu_0\frac{\partial\mathbf{E}}{\partial t}, \ \ \ \ \ (9)

only now we find that a current also induces a circulating magnetic field around the boundary.

4. References

[1] The LaTeX was written using the excellent tool LaTeX to WordPress:

LaTeX to WordPress


Mathematics, Tricks

Feynman’s Vector Calculus Trick

1. Introduction

Many people are familiar with the so-called `Feynman’s trick’ of differentiating under the integral. Buried in chapter 27-3 of the Feynman Lectures on Electromagnetism [1] though there lies another trick, one which can simplify problems in vector calculus by letting you treat the derivative operator {\nabla} as any other vector, without having to worry about commutativity . I don’t know if Feynman invented this himself, but I have never stumbled across it anywhere else.

Note: u/bolbteppa on Reddit has pointed out that this idea can be found in the very first book on vector calculus, written based on lectures given by Josiah Willard Gibbs.

What this trick will allow you to do is to treat the {\nabla} operator as if it were any other vector. This means that if you know a vector identity, you can immediately derive the corresponding vector calculus identity. Furthermore even if you do not have (or don’t want to look up) the identity, you can apply the usual rules of vectors assuming that everything is commutative, which is a nice simplification.

The trick appears during the derivation of the Poynting vector. We wish to simplify

\displaystyle \nabla\cdot(B\times E), \ \ \ \ \ (1)

where {B} and {E} are the magnetic and electric field respectively, though for our purposes they can just be any vector fields.

2. The trick

The problem we want to solve is that we cannot apply the usual rules of vectors to the derivative operator. For example, we have

\displaystyle A\times B=-B\times A,\;\;A\cdot B=B\cdot A \ \ \ \ \ (2)

but it is certainly not true that

\displaystyle \nabla\times A=-A\times\nabla,\;\;\nabla\cdot A=A\cdot\nabla. \ \ \ \ \ (3)

This means that when you want to break up an expression like {\nabla\cdot(B\times E)}, you can’t immediately reach for a vector identity {A\cdot(B\times C)=B\cdot(C\times A)} and expect the result to hold. Even if you aren’t using a table of identities, it would certainly make your life easier if you could find a way to treat {\nabla} like any other vector and bash out algebra like (3).

Let’s first restrict ourselves to two scalar functions {f} and {g}, we introduce the notation

\displaystyle \frac{\partial}{\partial x_f} \ \ \ \ \ (4)

to mean a derivative operator which only acts on {f}, not {g}. Moreover, it doesn’t matter where in the expression the derivative is, it is always interpreted as acting on {f}. In our notation the following are all equivalent:

\displaystyle \frac{\partial f}{\partial x}g=\frac{\partial}{\partial x_f}fg=f\frac{\partial}{\partial x_f}g=fg\frac{\partial}{\partial x_f}. \ \ \ \ \ (5)

Why did we do this? Well now the derivative {\frac{\partial}{\partial x_f}} behaves just like any other number! We can write our terms in any order we want, and still know what we mean.

Now let’s suppose we want to differentiate a product of terms:

\displaystyle \frac{\partial}{\partial x}(fg)=\frac{\partial f}{\partial x}g+f\frac{\partial g}{\partial x}. \ \ \ \ \ (6)

We can see that whenever we have such a product, we can write:

\displaystyle \begin{aligned} \frac{\partial}{\partial x}(fg) &= \left(\frac{\partial}{\partial x_f}+\frac{\partial}{\partial x_g}\right)fg, \\ &= \frac{\partial}{\partial x_f}fg+\frac{\partial}{\partial x_g}fg. \end{aligned} \ \ \ \ \ (7)

We want to generalise this to thinks like {\nabla\cdot(A\times B)}. Remembering that the derivative operator is interpreted as {\nabla=\left(\frac{\partial}{\partial x},\frac{\partial}{\partial y},\frac{\partial}{\partial z}\right)}, we define

\displaystyle \nabla_A=\left(\frac{\partial}{\partial x_A},\frac{\partial}{\partial y_A},\frac{\partial}{\partial z_A}\right). \ \ \ \ \ (8)

Here {\frac{\partial}{\partial x_A}} is interpreted as acting on any of the components {A_x}, {A_y}, {A_z} of {A}.

With this notation, keeping in mind the commutativity (5) of the derivative operator, we can see that

\displaystyle \nabla_A\cdot A=A\cdot\nabla_A, \ \ \ \ \ (9)

\displaystyle \nabla_A\times A=-A\times\nabla_A. \ \ \ \ \ (10)

Work out the components and see for yourself!

In the next section we will apply this trick to derive some common vector calculus identities. The idea is to take an expression such as {\nabla\cdot(E\times B)}, write it as {(\nabla_E+\nabla_B)\cdot(E\times B)}, and then expand this using our normal vector rules until we end up with {\nabla_E} acting only on {E} and {\nabla_B} on {B}, in which case we can replace them with the original {\nabla}.

3. Some examples

Here we will see how various vector identities can be generalised to include {\nabla} using the ideas from the previous section. All the identities I am using come from the Wikipedia page [2].

You may want to try and do each of these yourself before reading the solution. Have a look at the title of the section, check the Wikipedia page [2] for the corresponding vector identity, and have a play. If you get stuck read just enough of the solution until you find out what concept you were missing, and then go back to it. As they say, mathematics is not a spectator sport!.

3.1. {\nabla\cdot(A\times B)}

The corresponding vector identity is

\displaystyle A\cdot (B\times C)=B\cdot(C\times A)=C\cdot(A\times B). \ \ \ \ \ (11)

We can look at this as saying that the product {A\cdot(B\times C)} is invariant under cyclic permutations, i.e. if you shift {A\rightarrow B\rightarrow C\rightarrow A}. If we look at {A\cdot(B\times C)} as something with three slots: {\_\cdot(\_\times\_)}, this is saying that you can move everything one slot to the right (and the rightmost one `cycles’ to the left), or you can move everything one slot to the left (and the leftmost one `cycles’ to the right). This pattern comes up all the time in mathematics and physics, so it’s good to keep it in mind.

Let’s experiment and see where we go. Since every term will be a product of terms from {A} and terms from {B}, we may expand

\displaystyle \nabla\cdot(A\times B) = \nabla_A\cdot(A\times B)+\nabla_B\cdot(A\times B). \ \ \ \ \ (12)

We want to change this so that {\nabla_A} is acting on {A} and {\nabla_B} on {B}, then we can replace them with the original {\nabla}. So let’s cyclically permute the first term to the right, and the second to the left:

\displaystyle =B\cdot(\nabla_A\times A)+A\cdot(B\times\nabla_B). \ \ \ \ \ (13)

Finally, we use {A\times B=-B\times A} to re-write the last term:

\displaystyle \begin{aligned} &= B\cdot(\nabla_A\times A)-A\cdot(\nabla_B\times B), \\ &= B\cdot(\nabla\times A)-A\cdot(\nabla\times B). \end{aligned} \ \ \ \ \ (14)

We have thus derived

\displaystyle \nabla\cdot(A\times B)=B\cdot(\nabla\times A)-A\cdot(\nabla\times B). \ \ \ \ \ (15)

Better yet, now we have an idea of where that strange minus sign came from. The first two terms have the same cyclic order in their slots {\nabla\rightarrow A\rightarrow B\rightarrow\nabla}, and breaking this in the third term comes at the expense of a minus sign.

3.2. {\nabla\times(A\times B)}

The corresponding vector identity is

\displaystyle A\times(B\times C)=(A\cdot C)B-(A\cdot B)C. \ \ \ \ \ (16)

We thus have

\displaystyle (\nabla_A+\nabla_B)\times(A\times B)=\nabla_A\times (A\times B)+\nabla_B\times(A\times B). \ \ \ \ \ (17)

Let’s look at the first term, the second will be analogous.

\displaystyle \nabla_A\times(A\times B) = (\nabla_A\cdot B)A-(\nabla_A\cdot A)B. \ \ \ \ \ (18)

Note that the product {\nabla_A\cdot B} is not zero, as {\nabla_A} is a derivative operator which still acts on {A} anywhere in the equation (see (5)). We rearrange the above using the commutativity of the dot product to write

\displaystyle \begin{aligned} \nabla_A\times(A\times B) &= (B\cdot\nabla_A)A-(\nabla_A\cdot A)B, \\ &= (B\cdot\nabla)A-(\nabla\cdot A)B. \end{aligned} \ \ \ \ \ (19)

Swapping {A\leftrightarrow B} we obtain

\displaystyle \nabla_B\times(B\times A) = (A\cdot\nabla)B-(\nabla\cdot B)A, \ \ \ \ \ (20)


\displaystyle \nabla_B\times(A\times B) = -(A\cdot\nabla)B+(\nabla\cdot B)A. \ \ \ \ \ (21)

Putting the two together finally gives

\displaystyle \nabla\times(A\times B)=(B\cdot\nabla)A-(A\cdot\nabla)B+(\nabla\cdot B)A-(\nabla\cdot A)B. \ \ \ \ \ (22)

3.3. {\nabla\cdot(\psi A)}

Here {\psi} is just an ordinary scalar function, and {A} a vector. The difference makes this one a little bit tricky, but on the plus side we won’t have to look up any identities. Let’s begin by expanding as usual (since everything will be a product of {\psi} and terms from {A}):

\displaystyle \begin{aligned} \nabla\cdot(\psi A) &= \nabla_{\psi}\cdot(\psi A)+\nabla_A\cdot(\psi A). \end{aligned} \ \ \ \ \ (23)

For the second term we can pull the scalar {\psi} through {\nabla_A} to get {\psi(\nabla_A\cdot A)}. Let’s have a think about what we mean by the first term. The derivative operator is a vector

\displaystyle \nabla_{\psi}=\left(\frac{\partial}{\partial x_{\psi}},\frac{\partial}{\partial y_{\psi}},\frac{\partial}{\partial z_{\psi}}\right), \ \ \ \ \ (24)

and the quantity inside the brackets is a vector

\displaystyle (\psi A)=\left(\psi A_x,\psi A_y,\psi A_z\right), \ \ \ \ \ (25)

where {A_x} is the {x}-component of {A}, and so on. Taking the dot product of (24) and (25), we can see that this will give us

\displaystyle \begin{aligned} \nabla_{\psi}\cdot(\psi A) &= \frac{\partial}{\partial x_{\psi}}(\psi A_x)+\frac{\partial}{\partial y_{\psi}}(\psi A_y)\frac{\partial}{\partial z_{\psi}}(\psi A_z), \\ &= A_x\frac{\partial \psi}{\partial x_{\psi}}+A_y\frac{\partial \psi}{\partial y_{\psi}}+A_z\frac{\partial \psi}{\partial z_{\psi}}, \\ &=A\cdot\nabla_{\psi}\psi. \end{aligned} \ \ \ \ \ (26)

Putting all this together we arrive at

\displaystyle \nabla\cdot(\psi A)=A\cdot\nabla\psi+\psi\nabla\cdot A. \ \ \ \ \ (27)

4. Conclusion

We’ve learned a neat trick to treat the derivative operator just like any other vector. This is a cool and useful idea, which I hadn’t seen anywhere before I came across it in chapter 27-3 of [1]. Leave a comment or a tweet if you find other cool applications, or have ideas for further investigation. I notably did not touch on any of the second derivatives, such as {\nabla\cdot(\nabla\times A)} or {\nabla\times(\nabla\times A)}, and I’m sure that this trick would also simplify a lot of these. I also had a look at {\nabla(A\cdot B)}, and while you could use the trick there it turned out to be a bit complicated and involved some thinking to `guess’ terms which would fit what you wanted. Let me know if you find a nice simple way of doing this.

As a final application, u/Muphrid15 mentioned that this idea can be used to generalise the derivative operator to geometric algebra (also known as Clifford algebras). This is a sort of algebra for vector spaces, allowing you to do things like add one vector space to another or ajoin and subtract dimensions, and many calculations in vector algebra can be simplified immensely when put in this language.

Follow @RLecamwasam on twitter for more posts like this, or join the discussion on Reddit:

Feynman’s Vector Calculus Trick from Physics

Feynman’s vector calculus trick from math

5. References

[1] Leighton, R., & Sands, M. (1963). The Feynman Lectures on Physics, Volume II: Mainly Electromagnetism and Matter.

[2] Wikipedia contributors. (2019, February 20). Vector calculus identities. In Wikipedia, The Free Encyclopedia: Retrieved 23:01, February 22, 2019

[3] The LaTeX was written using the excellent tool LaTeX to WordPress:
LaTeX to WordPress

quantum algorithms, quantum information

Superdense coding

1. Introduction

In this article we will introduce superdense coding, a scheme which lets Alice send two bits of (classical) information to Bob by transmitting a single entangled qubit. This article will be mathematically rigorous, while hopefully also providing an intuitive explanation of what is really going on. We will assume an undergraduate understanding of quantum mechanics, including familiarity with Dirac notation and entanglement.

Suppose Alice has a qubit, whose state may be written as

\displaystyle a|0\rangle+b|1\rangle, \ \ \ \ \ (1)

where {a} and {b} are complex numbers such that {|a|^2+|b|^2=1}. It would seem from (1) that if Alice wished to encode some information in her state and then send it to Bob, she has a lot of freedom in her choice of {a} and {b}. In comparison to a classical bit, which can only take discrete values of {0} or {1}, it seems like a qubit is infinitely more powerful! However, there’s a big catch.

To access this information Bob needs to measure the qubit, and (assuming he measures in the {\{|0\rangle,|1\rangle\}} basis) his result will be either {0} or {1}, with probability {|a|^2} and {|b|^2} respectively. Once he does this the state is lost, and he can gain no more information. Thus the only way that Alice can deterministically transfer information is to send either the {|0\rangle} state or the {|1\rangle} state, in which case Bob can measure it to receive one bit of information. If Alice sends anything else, Bob won’t be able to draw a conclusion from a single measurement, after which the original state will be lost. Despite all the extra freedom we have in a qubit, the probabilistic nature of quantum measurement seems to imply we can’t do any better than with a classical bit.

It turns out however that if Alice and Bob start off by sharing an entangled state, Alice can deterministically transfer two bits of information with a single qubit, by using a scheme called ‘superdense coding’. We can think of this as them sharing one bit of entanglement, which together with the transfer of one qubit leads to two bits of information. This idea was introduced in 1992 by Charles Bennet and Stephen Wiesner (see References below for the paper link).

2. Some quantum gates

We will begin by defining four operators which Alice and Bob will use. Firstly there is the Pauli {\sigma_x}, which flips a qubit:

\displaystyle \sigma_x|0\rangle=|1\rangle, \ \ \ \ \ (2)

\displaystyle \sigma_x|1\rangle=|0\rangle. \ \ \ \ \ (3)

Next there is the Pauli {\sigma_z} operator, which flips the phase of the {|1\rangle} bit:

\displaystyle \sigma_z|0\rangle = |0\rangle, \ \ \ \ \ (4)

\displaystyle \sigma_z|1\rangle = -|1\rangle. \ \ \ \ \ (5)

The Hadamard operator sends the qubits to two orthogonal superpositions:

\displaystyle H|0\rangle=\frac{1}{\sqrt{2}}\left(|0\rangle+|1\rangle\right), \ \ \ \ \ (6)

\displaystyle H|1\rangle=\frac{1}{\sqrt{2}}\left(|0\rangle-|1\rangle\right). \ \ \ \ \ (7)

We can see that this also reverses itself:

\displaystyle \begin{aligned} H\frac{1}{\sqrt{2}}\left(|0\rangle+|1\rangle\right)&=\frac{1}{\sqrt{2}}\left(H|0\rangle+H|1\rangle\right), \\ &= \frac{1}{\sqrt{2}}\left(\frac{1}{\sqrt{2}}\left(|0\rangle+|1\rangle\right)+\frac{1}{\sqrt{2}}\left(|0\rangle-|1\rangle\right)\right), \\ &=\frac{1}{2}\left(2|0\rangle\right), \\ &= |0\rangle. \end{aligned} \ \ \ \ \ (8)


\displaystyle H\frac{1}{\sqrt{2}}\left(|0\rangle-|1\rangle\right)=|1\rangle. \ \ \ \ \ (9)

Finally there is the only two-qubit gate we will need, the controlled not (CNOT) gate. This takes two qubits; if the first (the control) is {|0\rangle}, it leaves the whole state unchanged:

\displaystyle CNOT\left(|0\rangle |0\rangle\right)=|0\rangle |0\rangle, \ \ \ \ \ (10)

\displaystyle CNOT\left(|0\rangle |1\rangle\right)=|0\rangle |1\rangle. \ \ \ \ \ (11)

If the control qubit is {|1\rangle} however then CNOT flips the target:

\displaystyle CNOT\left(|1\rangle |0\rangle\right)=|1\rangle |1\rangle, \ \ \ \ \ (12)

\displaystyle CNOT\left(|1\rangle |1\rangle\right)=|1\rangle |0\rangle. \ \ \ \ \ (13)

3. The superdense coding protocol

Let’s see how we can encode two bits of information in a single qubit. This time, Alice and Bob start off with a pair of entangled qubits:

\displaystyle |\Psi\rangle_{AB}=\frac{1}{\sqrt{2}}\left(|0\rangle_A|0\rangle_B+|1\rangle_A|1\rangle_B\right). \ \ \ \ \ (14)

In the equation above, {|0\rangle_A} represents Alice’s qubit being {|0\rangle}. Because this system is entangled, Alice’s and Bob’s states are intrinsically linked. This is best thought of as a single bipartite system rather than two individual qubits, and so local operations on Alice’s state will affect the state {|\Psi\rangle_{AB}} of the system as a whole.

Suppose Alice has two classical bits to encode, {\alpha} and {\beta}, each of which takes value either {0} or {1}. She encodes the first bit in the parity of her’s and Bob’s states, i.e. whether they are the same or different. If {\alpha} is {0} she does nothing, and so from (14) Alice’s and Bob’s qubits will be the same. If {\alpha} is {1} she applies a {\sigma_x} gate to her state, flipping it and resulting in the state

\displaystyle \sigma_{x,A}|\Psi\rangle_{AB}=\frac{1}{\sqrt{2}}\left(|1\rangle_A|0\rangle_B+|0\rangle_A|1\rangle_B\right). \ \ \ \ \ (15)

Thus her’s and Bob’s qubits will always be measured to be opposite.

Alice encodes her second bit {\beta} in the phase between the two states in the superposition. If {\beta} is {0} she again does nothing, however if {\beta} is {1} she applies the {\sigma_z} gate to her state, which will result in a minus sign between the two states.

As we mentioned belfore, even though Alice is applying these operators locally to her state, the system is an entangled bipartite state, and so we can think of her as applying global operators {\left(\sigma_{i,A}\otimes I_B\right)}, Pauli operators tensored with the identity, to the whole system. After Alice’s operations, if {\alpha=0} the global state will be

\displaystyle |\Psi\rangle_{AB}=\frac{1}{\sqrt{2}}\left(|0\rangle_A|0\rangle_B\pm|1\rangle_A|1\rangle_B\right), \ \ \ \ \ (16)

and if {\alpha=1} the global state will be

\displaystyle |\Psi\rangle_{AB}=\frac{1}{\sqrt{2}}\left(|0\rangle_A|1\rangle_B\pm |1\rangle_A|0\rangle_B\right), \ \ \ \ \ (17)

where in both cases the sign is positive if {\beta=0}, and negative if {\beta=1}. Again we note that {\alpha} is encoded in the parity, whether Alice or Bob’s quibts are the same or different, and {\beta} in the phase between the two superpositions. This phase is the new degree of freedom which we get from entanglement.

Alice then sends her single qubit to Bob, who now possess both states of the bipartite system. Even though Alice has only transmitted a single qubit, because their states were entangled Bob may recover both of the operations that Alice performed. To do this Bob performs the following steps:

  1. To measure the parity Bob applies the CNOT gate on the system, using Alice’s bit as the control. If {\alpha=0}, this will send (16) to

    \displaystyle \begin{aligned} CNOT_A|\Psi\rangle_{AB} &=\frac{1}{\sqrt{2}}\left(|0\rangle_A|0\rangle_B\pm|1\rangle_A|0\rangle_B\right), \\ &=\frac{1}{\sqrt{2}}\left(|0\rangle_A\pm|1\rangle_A\right)|0\rangle_B, \end{aligned} \ \ \ \ \ (18)

    and if {\alpha=1} this will send (17) to

    \displaystyle CNOT_A|\Psi\rangle_{AB}=\frac{1}{\sqrt{2}}\left(|0\rangle_A\pm|1\rangle_A\right)|1\rangle_B, \ \ \ \ \ (19)

    Bob could now deterministically read out the value of {\alpha} simply by performing a measurement on his qubit!

  2. To measure the phase, Bob applies the Hadamard gate to Alice’s qubit. Looking at the two equations above, we see that regardless of Bob’s qubit, Alice’s is in the superposition

    \displaystyle \frac{1}{\sqrt{2}}\left(|0\rangle_A\pm|1\rangle_A\right), \ \ \ \ \ (20)

    where the sign is positive if {\beta=0} and negative if {\beta=1}. In the former case the Hadamard gate will send this to {|0\rangle_A}, and in the latter to {|1\rangle_A}.

We can see then that after this protocol, Bob has the state:

\displaystyle |\alpha\beta\rangle. \ \ \ \ \ (21)

He may therefore perform a single measurement on the two qubits he possess, and in doing so learn the value of both bits {\alpha} and {\beta}! Alice thus used one qubit, and one bit of entanglement, to transmit two bits of information to Bob.

4. Discussion

Follow @RLecamwasam on twitter for more posts like this. Questions/comments/criticisms? Feel free to leave comment, either here or on the Reddit thread:

Superdense coding explained from Physics

u/RRumpleTeazzer pointed out that this protocol still involves the transmission of two qubits. We could imagine this as Alice first prepares the entangled state superposition {|\Psi\rangle_{AB}}, sends one of the qubits to Bob, and then performs the superdense coding protocol on her remaining qubit before sending this to him as well. So really, this is Alice sending two classical bits via two qubits.

What I think still makes this process surprising from a classical point of view is that all of Alice’s encoding happens after Bob already has the first qubit. They begin by sharing the resource of an entangled state, Alice encodes two classical bits on her qubit, and then sends this to Bob who can decode them both. Of course from the quantum point of view this is perfectly natural; since this is a bipartite entangled state, it is better to think of Alice performing operations on the global state {|\Psi\rangle_{AB}}, rather than on ‘her qubit’. As u/RRumpleTeazzer’s says, ‘delayed choice coding’ is perhaps an equally good name.

u/NidStyles and u/gabeff asked about experimental implementations of superdense coding. The first implementation was in 1996 (see References) and used photons as qubits, where {|0\rangle} and {|1\rangle} were the Horizontal and Vertical polarisation states {|H\rangle} and {|V\rangle}. The initial superposition was created using a process called ‘spontaneous parameteric downconversion’, where a nonlinear crystal creates pairs of photons whose polarisations are entangled with each other:

\displaystyle |\Psi\rangle=\frac{1}{2}\left(|H\rangle|H\rangle+|V\rangle|V\rangle\right). \ \ \ \ \ (22)

The problem with this experiment however was that Bob could only measure three of Alice’s four possible messages. These four messages were:

\displaystyle |\Psi^+\rangle=\frac{1}{2}\left(|H\rangle|V\rangle+|V\rangle|H\rangle\right), \ \ \ \ \ (23)

\displaystyle |\Psi^-\rangle=\frac{1}{2}\left(|H\rangle|V\rangle-|V\rangle|H\rangle\right), \ \ \ \ \ (24)

\displaystyle |\Phi^+\rangle=\frac{1}{2}\left(|H\rangle|H\rangle+|V\rangle|V\rangle\right), \ \ \ \ \ (25)

\displaystyle |\Phi^-\rangle=\frac{1}{2}\left(|H\rangle|H\rangle-|V\rangle|V\rangle\right). \ \ \ \ \ (26)

The experimenters interfered these in such a way that you could distinguish states which were symmetric in interchanging the photons from states which were anti-symemtric. We can see above that {|\Psi^-\rangle} is the only anti-symmetric state (if you swap the two photons this is the only one which picks up a minus sign), and so this one could be immediately read out. For the other three, they passed them through a scheme which could determine if the photons had the same or different polarisations. If they were different, this corresponded to {|\Psi^+\rangle}. If they were the same however it could be either of {|\Phi^+\rangle} or {|\Phi^-\rangle}, with no way of distinguishing them further.

These difficulties were resolved in a later experiment in 2008 (again see References). In this, each qubit was composed two photons rather than one, with the first of each pair entangled in polarisation, and the second in angular momentum. This extra degree of freedom allowed the experimenters to distinguish the four possible messages.

Because of the intricacies of the setups, both of these should be seen as more ‘proof of principle’ than scalable methods for quantum communication.

4. References

John Watrous’s Lecture Notes ‘Introduction to Quantum Computing (Winter 2006)’.
See Lecture 3: ‘Superdense coding; quantum circuits; and partial measurements’ – https://cs.uwaterloo.ca/~watrous/LectureNotes.html.

The Wikipedia page on ‘Superdense coding’: https://en.wikipedia.org/wiki/Superdense_coding

Also check out the original paper:
Bennett, C. H., & Wiesner, S. J. (1992). Communication via one- and two-particle operators on Einstein-Podolsky- Rosen states. Physical Review Letters, 69(20), 2881–2884. http://doi.org/10.1103/PhysRevLett.69.2881

The first experimental implementation was in 1996 using photons as qubits, however in this one Bob could only recover three out of the four possible messages:
Mattle, K., Weinfurter, H., Kwiat, P. G., & Zeilinger, A. (1996). Dense coding in experimental quantum communication. Physical Review Letters, 76(25), 4656–4659. http://doi.org/10.1103/PhysRevLett.76.4656

A newer implementation in 2008 allowed Bob to decode all four messages. This was done by composing each qubit of two photons, rather than one:
Barreiro, J. T., Wei, T. C., & Kwiat, P. G. (2008). Beating the channel capacity limit for linear photonic superdense coding. Nature Physics, 4(4), 282–286. http://doi.org/10.1038/nphys919

LaTeX and document formatting was done via the amazing tool LaTeX to WordPress: