## Information theory background I

This is the first of a series of two posts with a goal of providing a background on information theory. For today, the itinerary is as follows: we will start with introducing basic information quantities, proceed with proving  tensorization  statements for them — i.e. that they behave well under the product measure. The tensorization, in turn, will allow us to prove a much stronger counterpart of McDiarmid’s inequality in the next post.

Basic information quantities and relations

We start with the usual  entropy  of a discrete random variable $X \in \mathcal{X} = \{x_i\}_{i \in \mathcal{I}}$  with distribution $p_i = P(X = x_i)$  as $\displaystyle H(X) = \mathbb{E}_{P} \log \left(\frac{1}{P(X)}\right) = -\sum_{i \in \mathcal{I}} p_i \log p_i.$

Note that $H(X)$  depends only on $(p_i)$  and characterizes the distribution rather than the random variable $X$  itself. The entropy of the ‘composite’ random variable $(X, Y)$  with joint distribution $\left(\pi_{ij}\right)$  is called the  joint entropy  of $X, Y:$ $\displaystyle H(X,Y) = -\sum_{(i,j) \in \mathcal{I} \times \mathcal{J}} \pi_{ij} \log \pi_{ij},$

and  conditional entropy $\displaystyle H(Y|X) = H(X,Y)-H(X) = -\sum_{(i,j) \in \mathcal{I} \times \mathcal{J}} \pi_{ij} \log \pi_{j|i},$

where $\pi_{j|i} = \frac{\pi_{ij}}{p_i}$  is the conditional probability that $Y=y_j$  given $X = x_i.$  That is, we need $H(X,Y)$  bits to record $(X,Y),$  but it may be done in the alternative way, by coding $X$  first, and then $Y$  for each value of $X$  separately. Another useful quantity is the mutual information between $X$  and $Y,$ $\displaystyle I(X,Y)=H(Y)-H(Y|X)=H(X)-H(X|Y) \ge 0,$

i.e.  conditioning may only reduce the entropy. This Venn diagram taken from Wikipedia depicts the relations between all these quantities: Tensorization

In this part of the post, we will formulate tensorization inequalities for different notions of entropy. The first of them is for the usual entropy of a discrete probability measure.

Han’s inequality

For any random variables $X_1,\, ...,\, X_n,$ $\displaystyle H(X_1,\, ...,\, X_n) \le \frac{1}{n-1} \sum_{i=1}^n H(X_1,\, ...,\, X_{i-1}, X_{i+1},\, ...,\, X_n).$

Remark.  Equality is attained for independent $(X_i)_{i=1}^n.$ $H(X_1,\, ...,\, X_n)= H(X_1,\, ..., X_{i-1}, X_{i+1},\, ...,\, X_n)\, +\, H(X_i\,|\,X_1,\, ...,\, X_{i-1}, X_{i+1},\, ...,\, X_n)$ $\le H(X_1,\, ..., X_{i-1}, X_{i+1},\, ...,\, X_n)\,+\,H(X_i\,|\,X_1,\, ...,\, X_{i-1})$

and sum over $i.$ ♦

We now generalize the notion of entropy of a probability measure.

Definition

Relative entropy of a probability measure $\mu$  with respect to a dominating measure $\nu \gg \mu,$  that is $\nu(A) = 0 \Rightarrow \mu(A) = 0,$  not necessarily normalized to 1, is $\displaystyle {D}(\mu\|\nu) = \int \text{d}\mu \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (0)$

(note that since $\nu$  may be not a probability measure, this quantity might be negative).  In particular, taking $\nu$  as a counting/Lebesgue measure  on $\mathcal{X}$  recovers the usual discrete/differential entropy of $X \in \mathcal{X}:$ $\displaystyle H(X) = -D\left(\mu_X \| \nu_{\mathcal{X}}\right),\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (1)$

and conditional entropy as $\displaystyle H(Y|X) = -D\left(\mu_{X,Y}\|\mu_X\times\nu_{\mathcal{Y}}\right);\,\,\,\,\,\,\,\,\,\, (2)$

With this notation, we see that the identity $H(X, Y) = H(X) + H(Y|X)$

holds solely due to the fact that both Lebesgue and counting measure are product measures, i.e. $\displaystyle \nu_{\mathcal{X}\times\mathcal{Y}}.$  Given any probability measure $\mu$  dominated by $\nu,$  we may now define the entropy of $\mu$  relative to $\nu$  by (1), and conditional entropy by (2), thus retaining the chain formula. Besides, $\displaystyle \int_{\mathcal{X}\times\mathcal{Y}} \text{d}\mu_{X,Y} = 1$  yields $\displaystyle H(X) - H(X|Y) = \int_{\mathcal{X}\times\mathcal{Y}}\text{d}\mu_{X,Y} \log\left(\frac{\text{d}\mu_{X,Y}}{\text{d}\mu_X \,\text{d}\mu_Y}\right) = D(\mu_{X,Y}\|\mu_X \times \mu_Y) \ge 0,$

since $\displaystyle \mu_X \times \mu_Y$  is a probability measure as is $\displaystyle\mu_{X,Y}.$  As a result, we may repeat the same chaining argument as in the proof of Han’s inequality to obtain

Theorem (Han’s inequality for relative entropies)

Given any probability measure $Q$  on the product space $\mathcal{X} = \mathcal{X}_1 \times ... \times \mathcal{X}_n$ and a dominating product measure $P \gg Q,$  it holds $\displaystyle D(Q\|P) \ge \frac{1}{n-1} \sum_{i=1}^n D(Q^{(i)}\|P^{(i)}),$

where $P^{(i)}, Q^{(i)}$  denote $P, Q$  marginalized over $\mathcal{X}_i.$

Return to the definition (0) and now suppose that $\nu$  is normalized (i.e. a probability measure), which means that $\displaystyle {D}(\mu\|\nu) = \int \text{d}\mu \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) =\int \text{d}\nu\, \frac{\text{d}\mu}{\text{d}\nu} \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) = \mathbb{E}_\nu \, [Z \log Z]$

for the random variable $\displaystyle Z = \frac{\text{d}\mu}{\text{d}\nu}.$  Note that if $\mu$  is also a probability measure, then $Z$  is in fact the likelihood ratio having the unit expectation, hence $\displaystyle {D}(\mu\|\nu) =\mathbb{E}_\nu \, [Z \log Z] \ge \mathbb{E}_\nu [Z] \log \mathbb{E}_\nu [Z] = 0$

due to the convexity of $\psi(z) = z \log z.$  If $\mu$  isn’t normalized to one, then still $\displaystyle \mathbb{E}_\nu \, [Z \log Z] - \mathbb{E}_\nu [Z] \log \mathbb{E}_\nu [Z] \ge 0,$  which motivates the following

Definition

The $\psi$-entropy $\text{Ent}[Z]$  of a non-negative random variable (‘improper likelihood ratio’) $Z$  is $\text{Ent}[Z]=\mathbb{E} \, [Z \log Z] - \mathbb{E} [Z] \log \mathbb{E}[Z].$

Remark 1.  Given a dominating measure $\nu,$  any non-negative random variable may be considered as the ‘improper likelihood ratio’ $\displaystyle {\text{d}\mu / \text{d}\nu},$  where $\mu$  may be not normalized to 1, hence the definition.

Remark 2. $\text{Ent}[Z]$  is a norm which captures the variation of random variables $Z$ with respect to the common dominating measure.  Indeed, as we have already demonstrated, $\text{Ent}[Z] \ge 0,$  with equality i.f.f. $Z$  is a fixed constant (almost surely). Second, $\displaystyle \text{Ent}[Z] = M \cdot D(\mu_1 \| \nu),$

where $\displaystyle M = \mathbb{E}[Z] = \int \text{d}\mu,$  and $\mu_1 = \mu/M.$  As a result, $\displaystyle \text{Ent}[\alpha Z] = \alpha\, \text{Ent}[Z].$

Finally, convexity of $\displaystyle \text{Ent}[Z]$  is also easy to check.

We are now to state the tensorization result. Consider $\displaystyle \text{Ent}_i\, f(x_1,\, ...,\, x_n) := \text{Ent}[f(x_1,\, ...,\, x_{i-1},\, X_i,\, x_{i+1},\, ...,\, x_n)].$

Theorem  (Tensorization of $\psi$-entropy)

For any non-negative function $f$  of independent $X_1,\, ...,\, X_n$  it holds $\displaystyle \text{Ent}\left[f(X_1,\, ...,\, X_n)\right] \le \mathbb{E}\left[\sum_{i=1}^n\text{Ent}_i\, f(x_1,\, ...,\, x_n) \right].$

Proof.  First, without limiting generality, assume that $\mathbb{E} f = 1,$  i.e. $\displaystyle f = \frac{\text{d}Q}{\text{d}P}$

for some probability measure $Q \ll P,$  where $P = P_1 \times ... \times P_n$  is a product probability measure since $X_1,\, ...,\, X_n$  are independent. Rearranging Han’s inequality, we have $\displaystyle \text{Ent}\left[f(X_1,\, ...,\, X_n)\right] = D(Q\|P) \le \sum_{i=1}^n D(Q\|P)-D(Q^{(i)}\|P^{(i)}).$

On the other hand, through some simple algebra we make sure that $\mathbb{E} \left[\text{Ent}_i\, f(X_1,\, ...,\, X_n)\right] = D(Q\|P)-D(Q^{(i)}\|P^{(i)}).$

The general statement for $\mathbb{E}f \ne 1$  follows by scaling due to Remark 2.  ♦ 