Skip to article frontmatterSkip to article content

Entropies

Abstract

Some details about classical and quantum entropy functions

Keywords:EntropyRelative EntropyRényi Entropies.

Broadly, entropy functions describe the level of uncertainty in a probability distribution. Here, we detail the underlying concepts that leads to the construction of entropy functions.

Information gain or Surprisal

Consider a discrete random variable XX, where the one gets the outcome xx with probability pxp_x. If there are nn possible outcomes, this can be described by the probability distribution

p=(p1,p2,p3,,pn)  :  px0  x,  x=1npx=1.\bm{p}=(p_1, p_2, p_3, \ldots, p_n) ~ ~ : ~ ~ p_x \geq 0~\forall~x, ~ ~\sum_{x=1}^n p_x=1.

If one looks at the outcome of the random variable and sees that it is yy, then the information they have gained from now knowing this is defined by

I(py)log(py).I(p_y) \coloneqq -\log(p_y).

It can be seen that if the outcome yy occurs with high probability, then I(py)I(p_y) is very small; if the outcome yy occurs with low probability, then I(py)I(p_y) is very large. That is, we gain lots of information if we are surprised about the outcome i.e., that event had a low probability of occurring. For this reason, the function I()I(\cdot) is sometimes refereed to as the surprisal.

Mathematical Justifications for I()I(\cdot)

The mathematical justifications for I()I(\cdot) being the correct function for quantify the information gain is:

  1. The information gain should depend only on the probability of a given outcome occurring, not on its label.

    • Given a probability distribution p=(0.75,0.25)\bm{p} = (0.75,0.25) it should not matter if this gives the outcomes of a coin or the color of ball, the theory of the information gain should be independent of the physical system in which that information is encoded.

  2. The function should be a smooth function of probability.

    • This means there are no jumps in the function and it varies continuously from max information gain to minimum information gain.

  3. I(pypz)=I(py)+I(pz)\bm{I(p_yp_z) = I(p_y) + I(p_z)}

    • The probability of two independent events yy and zz occurring is pypzp_yp_z. This condition then means that the information gained from the two independent events occurring is the sum of the information gained from the individual events. This can be justified physically, as if the events are independent then learning about one should not tell us about the other.

The probability distribution of getting each outcome of a random variable as compared to the information gain of getting each outcome (the surprisal).

The probability distribution of getting each outcome of a random variable as compared to the information gain of getting each outcome (the surprisal).

Entropies

The Shannon entropy of the random variable XX is then nothing more then the average information gain:

H(X)Ep[I(px)]=x=1npxlog(px),\begin{split} H(X) \coloneqq \mathbb{E}_{\bm{p}} \big[ I(p_x) \big] &= -\sum_{x=1}^n p_x \log(p_x), \\ \end{split}

where Ep[]\mathbb{E}_{\bm{p}} \big[ \cdot \big] is the expectation value of ()(\cdot) over the probability distribution p\bm{p}.

The entropy can be thought of as either:

Example

Let the random variable XX have a probability distribution

p=(0.7,0.2,0.1).\bm{p} = (0.7,0.2,0.1).

The information gain or surprisal is then

log2(p)=(0.51,2.32,3.32),- \log_2\big(\bm{p}\big) = (0.51,2.32,3.32),

where we have chosen to use log based two.

The entropy is then

H(Xp)=(0.7×0.51)+(0.2×2.32)+(0.1×3.32)=1.153.\begin{align*} H(X_{\bm{p}}) &= (0.7 \times 0.51) + (0.2 \times 2.32) + (0.1 \times 3.32) \\ &= 1.153. \end{align*}

We now compare this to the extreme distributions

q=(1,0,0),  1=(1/3,1/3,1/3),\bm{q} = (1,0,0), ~ ~ \bm{1} = (1/3, 1/3, 1/3),

which are the certain and maximally uncertain distributions respectively. These then have entropy

H(Xq)=0   and   H(X1)=log2(3)\begin{align*} H(X_{\bm{q}}) &= 0 ~ ~ ~ \textrm{and} ~~ ~ H(X_{\bm{1}}) = \log_2\big(3\big) \end{align*}

Given q\bm{q}, there is no uncertainty in the outcome of the random variable, we therefore learn nothing when the outcome becomes known, and the entropy is zero.

Given 1\bm{1}, there is maximum uncertainty in the outcome of the random variable, we therefore gain the maximum amount of information when the outcome becomes known, and the entropy is maximum.

It can then be seen that, given p\bm{p}, we have

H(Xq)H(Xp)H(X1),H(X_{\bm{q}}) \leq H(X_{\bm{p}}) \leq H(X_{\bm{1}}),

meaning that the uncertainty before the outcome of the random variable becomes know, or the information gained after, sits between these two extreme distributions.

Through the noiseless coding theorem, the entropy has an operational interpretation, which gives it a solid grounding in practical reality: given a two outcome random variable with XX, from which kk outputs have been collected, kH(X)kH(X) bits are needed to stored these kk outputs.

Generalised Entropies

Given a set of numbers that are distributed via some probability distribution, the mean is not the only function that can be used to aggregate the set of numbers.

Specifically, one can generalise the notion of means using Kolmogorov–Nagumo averages.

Kolmogorov–Nagumo averages

Let X={x1,x2,x3,,xn}X = \{x_1, x_2, x_3, \ldots, x_n\} be a set of nn different numbers, and let f()f(\cdot) be a continuous and injective (one-to-one) function.

The ff-mean of XX is

Mf(X)=f1(1ni=1nf(xi)).M_f(X) = f^{-1} \biggl( \frac{1}{n} \sum_{i=1}^n f(x_i) \biggl).

If f(x)=ax+bf(x) = ax+b for some constants aa and bb, this reduces to the arithmetic mean.

Using the Kolmogorov–Nagumo averages, a general entropy function can then be defined as

Hg(X)=f1(x=1npxf(I(px))),H_{g}(X) = f^{-1} \bigg( \sum_{x=1}^n p_x f\big(I(p_x)\big) \bigg),

where each element is weighted by its probability of occurring, rather then equally as in the definition of the Kolmogorov–Nagumo averages given above.

Using this, it can be seen that the Shannon entropy is a special case of the ff-mean of the information gain of a probability distribution where f(x)=ax+bf(x)=ax+b.

α\alpha-Rényi Entropies

Rényi looked for functions f()f(\cdot) that made Hg(X)H_g(X) additive for two independent random variables, as this was a desirable feature of the Shannon entropy. In addition to f(x)=ax+bf(x)=ax+b, it was found f(x)=ce(α1)xf(x) = ce^{(\alpha-1)x} also made Hg(X)H_g(X) additive, where α\alpha is a parameter and cc a constant.

Using this for c=1c=1 gives the so-called α\alpha-Rényi Entropies:

Hα(X)=11αlog(x=1npxα).H_{\alpha}(X) = \frac{1}{1-\alpha} \log\bigg( \sum_{x=1}^n p_x^{\alpha} \bigg).

For α=0,1,\alpha=0,1,\infty it is defined via the limits towards these values.

Just as different means inform us about different features of a set of numbers, the different Rényi entropies informs us about different features of the information gain of a distribution.

It can be shown that

limα1Hα(X)=H(X),\lim_{\alpha \rightarrow 1} H_{\alpha}(X) = H(X),

meaning that in the limit of α\alpha tending toward one, the Rényi entropy becomes the Shannon entropy.