Entropies

Abstract¶

Some details about classical and quantum entropy functions

Keywords:EntropyRelative EntropyRényi Entropies.¶

Broadly, entropy functions describe the level of uncertainty in a probability distribution. Here, we detail the underlying concepts that leads to the construction of entropy functions.

Information gain or Surprisal¶

Consider a discrete random variable $X$ , where the one gets the outcome $x$ with probability $p_x$ . If there are $n$ possible outcomes, this can be described by the probability distribution

\bm{p}=(p_1, p_2, p_3, \ldots, p_n) ~ ~ : ~ ~ p_x \geq 0~\forall~x, ~ ~\sum_{x=1}^n p_x=1.

(1)

If one looks at the outcome of the random variable and sees that it is $y$ , then the information they have gained from now knowing this is defined by

I(p_y) \coloneqq -\log(p_y).

(2)

It can be seen that if the outcome $y$ occurs with high probability, then $I(p_y)$ is very small; if the outcome $y$ occurs with low probability, then $I(p_y)$ is very large. That is, we gain lots of information if we are surprised about the outcome i.e., that event had a low probability of occurring. For this reason, the function $I(\cdot)$ is sometimes refereed to as the surprisal.

The probability distribution of getting each outcome of a random variable as compared to the information gain of getting each outcome (the surprisal).

Entropies¶

The Shannon entropy of the random variable $X$ is then nothing more then the average information gain:

\begin{split} H(X) \coloneqq \mathbb{E}_{\bm{p}} \big[ I(p_x) \big] &= -\sum_{x=1}^n p_x \log(p_x), \\ \end{split}

(3)

where $\mathbb{E}_{\bm{p}} \big[ \cdot \big]$ is the expectation value of $(\cdot)$ over the probability distribution $\bm{p}$ .

The entropy can be thought of as either:

a measure of the uncertainty in the outcome of the random variable before the outcome is known
a measure of how much information is gained after the outcome of a random variable becomes known.

Through the noiseless coding theorem, the entropy has an operational interpretation, which gives it a solid grounding in practical reality: given a two outcome random variable with $X$ , from which $k$ outputs have been collected, $kH(X)$ bits are needed to stored these $k$ outputs.

Generalised Entropies¶

Given a set of numbers that are distributed via some probability distribution, the mean is not the only function that can be used to aggregate the set of numbers.

Specifically, one can generalise the notion of means using Kolmogorov–Nagumo averages.

Kolmogorov–Nagumo averages

Let $X = \{x_1, x_2, x_3, \ldots, x_n\}$ be a set of $n$ different numbers, and let $f(\cdot)$ be a continuous and injective (one-to-one) function.

The $f-$ mean of $X$ is

M_f(X) = f^{-1} \biggl( \frac{1}{n} \sum_{i=1}^n f(x_i) \biggl).

(10)

If $f(x) = ax+b$ for some constants $a$ and $b$ , this reduces to the arithmetic mean.

Using the Kolmogorov–Nagumo averages, a general entropy function can then be defined as

H_{g}(X) = f^{-1} \bigg( \sum_{x=1}^n p_x f\big(I(p_x)\big) \bigg),

(11)

where each element is weighted by its probability of occurring, rather then equally as in the definition of the Kolmogorov–Nagumo averages given above.

Using this, it can be seen that the Shannon entropy is a special case of the $f-$ mean of the information gain of a probability distribution where $f(x)=ax+b$ .

$\alpha$ -Rényi Entropies¶

Rényi looked for functions $f(\cdot)$ that made $H_g(X)$ additive for two independent random variables, as this was a desirable feature of the Shannon entropy. In addition to $f(x)=ax+b$ , it was found $f(x) = ce^{(\alpha-1)x}$ also made $H_g(X)$ additive, where $\alpha$ is a parameter and $c$ a constant.

Using this for $c=1$ gives the so-called $\alpha$ -Rényi Entropies:

H_{\alpha}(X) = \frac{1}{1-\alpha} \log\bigg( \sum_{x=1}^n p_x^{\alpha} \bigg).

(12)

For $\alpha=0,1,\infty$ it is defined via the limits towards these values.

Just as different means inform us about different features of a set of numbers, the different Rényi entropies informs us about different features of the information gain of a distribution.

It can be shown that

\lim_{\alpha \rightarrow 1} H_{\alpha}(X) = H(X),

(13)

meaning that in the limit of $\alpha$ tending toward one, the Rényi entropy becomes the Shannon entropy.

Important Mathematical Notions

Convex and Affine Sets