## Information theory background I

This is the first of a series of two posts with a goal of providing a background on information theory. For today, the itinerary is as follows: we will start with introducing basic information quantities, proceed with proving  tensorization  statements for them — i.e. that they behave well under the product measure. The tensorization, in turn, will allow us to prove a much stronger counterpart of McDiarmid’s inequality in the next post.

Basic information quantities and relations

We start with the usual  entropy  of a discrete random variable  $X \in \mathcal{X} = \{x_i\}_{i \in \mathcal{I}}$  with distribution  $p_i = P(X = x_i)$  as

$\displaystyle H(X) = \mathbb{E}_{P} \log \left(\frac{1}{P(X)}\right) = -\sum_{i \in \mathcal{I}} p_i \log p_i.$

Note that  $H(X)$  depends only on  $(p_i)$  and characterizes the distribution rather than the random variable  $X$  itself. The entropy of the ‘composite’ random variable  $(X, Y)$  with joint distribution  $\left(\pi_{ij}\right)$  is called the  joint entropy  of $X, Y:$

$\displaystyle H(X,Y) = -\sum_{(i,j) \in \mathcal{I} \times \mathcal{J}} \pi_{ij} \log \pi_{ij},$

and  conditional entropy

$\displaystyle H(Y|X) = H(X,Y)-H(X) = -\sum_{(i,j) \in \mathcal{I} \times \mathcal{J}} \pi_{ij} \log \pi_{j|i},$

where  $\pi_{j|i} = \frac{\pi_{ij}}{p_i}$  is the conditional probability that  $Y=y_j$  given  $X = x_i.$  That is, we need  $H(X,Y)$  bits to record  $(X,Y),$  but it may be done in the alternative way, by coding  $X$  first, and then  $Y$  for each value of  $X$  separately. Another useful quantity is the mutual information between  $X$  and  $Y,$

$\displaystyle I(X,Y)=H(Y)-H(Y|X)=H(X)-H(X|Y) \ge 0,$

i.e.  conditioning may only reduce the entropy. This Venn diagram taken from Wikipedia depicts the relations between all these quantities:

Tensorization

In this part of the post, we will formulate tensorization inequalities for different notions of entropy. The first of them is for the usual entropy of a discrete probability measure.

Han’s inequality

For any random variables $X_1,\, ...,\, X_n,$

$\displaystyle H(X_1,\, ...,\, X_n) \le \frac{1}{n-1} \sum_{i=1}^n H(X_1,\, ...,\, X_{i-1}, X_{i+1},\, ...,\, X_n).$

Remark.  Equality is attained for independent  $(X_i)_{i=1}^n.$

$H(X_1,\, ...,\, X_n)= H(X_1,\, ..., X_{i-1}, X_{i+1},\, ...,\, X_n)\, +\, H(X_i\,|\,X_1,\, ...,\, X_{i-1}, X_{i+1},\, ...,\, X_n)$

$\le H(X_1,\, ..., X_{i-1}, X_{i+1},\, ...,\, X_n)\,+\,H(X_i\,|\,X_1,\, ...,\, X_{i-1})$

and sum over  $i.$ ♦

We now generalize the notion of entropy of a probability measure.

Definition

Relative entropy of a probability measure  $\mu$  with respect to a dominating measure  $\nu \gg \mu,$  that is  $\nu(A) = 0 \Rightarrow \mu(A) = 0,$  not necessarily normalized to 1, is

$\displaystyle {D}(\mu\|\nu) = \int \text{d}\mu \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (0)$

(note that since  $\nu$  may be not a probability measure, this quantity might be negative).  In particular, taking $\nu$  as a counting/Lebesgue measure  on  $\mathcal{X}$  recovers the usual discrete/differential entropy of  $X \in \mathcal{X}:$

$\displaystyle H(X) = -D\left(\mu_X \| \nu_{\mathcal{X}}\right),\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (1)$

and conditional entropy as

$\displaystyle H(Y|X) = -D\left(\mu_{X,Y}\|\mu_X\times\nu_{\mathcal{Y}}\right);\,\,\,\,\,\,\,\,\,\, (2)$

With this notation, we see that the identity

$H(X, Y) = H(X) + H(Y|X)$

holds solely due to the fact that both Lebesgue and counting measure are product measures, i.e.  $\displaystyle \nu_{\mathcal{X}\times\mathcal{Y}}.$  Given any probability measure  $\mu$  dominated by  $\nu,$  we may now define the entropy of  $\mu$  relative to $\nu$  by (1), and conditional entropy by (2), thus retaining the chain formula. Besides,  $\displaystyle \int_{\mathcal{X}\times\mathcal{Y}} \text{d}\mu_{X,Y} = 1$  yields

$\displaystyle H(X) - H(X|Y) = \int_{\mathcal{X}\times\mathcal{Y}}\text{d}\mu_{X,Y} \log\left(\frac{\text{d}\mu_{X,Y}}{\text{d}\mu_X \,\text{d}\mu_Y}\right) = D(\mu_{X,Y}\|\mu_X \times \mu_Y) \ge 0,$

since  $\displaystyle \mu_X \times \mu_Y$  is a probability measure as is  $\displaystyle\mu_{X,Y}.$  As a result, we may repeat the same chaining argument as in the proof of Han’s inequality to obtain

Theorem (Han’s inequality for relative entropies)

Given any probability measure  $Q$  on the product space $\mathcal{X} = \mathcal{X}_1 \times ... \times \mathcal{X}_n$ and a dominating product measure  $P \gg Q,$  it holds

$\displaystyle D(Q\|P) \ge \frac{1}{n-1} \sum_{i=1}^n D(Q^{(i)}\|P^{(i)}),$

where  $P^{(i)}, Q^{(i)}$  denote  $P, Q$  marginalized over $\mathcal{X}_i.$

Return to the definition (0) and now suppose that  $\nu$  is normalized (i.e. a probability measure), which means that

$\displaystyle {D}(\mu\|\nu) = \int \text{d}\mu \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) =\int \text{d}\nu\, \frac{\text{d}\mu}{\text{d}\nu} \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) = \mathbb{E}_\nu \, [Z \log Z]$

for the random variable $\displaystyle Z = \frac{\text{d}\mu}{\text{d}\nu}.$  Note that if  $\mu$  is also a probability measure, then  $Z$  is in fact the likelihood ratio having the unit expectation, hence

$\displaystyle {D}(\mu\|\nu) =\mathbb{E}_\nu \, [Z \log Z] \ge \mathbb{E}_\nu [Z] \log \mathbb{E}_\nu [Z] = 0$

due to the convexity of  $\psi(z) = z \log z.$  If  $\mu$  isn’t normalized to one, then still  $\displaystyle \mathbb{E}_\nu \, [Z \log Z] - \mathbb{E}_\nu [Z] \log \mathbb{E}_\nu [Z] \ge 0,$  which motivates the following

Definition

The $\psi$-entropy  $\text{Ent}[Z]$  of a non-negative random variable (‘improper likelihood ratio’) $Z$  is

$\text{Ent}[Z]=\mathbb{E} \, [Z \log Z] - \mathbb{E} [Z] \log \mathbb{E}[Z].$

Remark 1.  Given a dominating measure $\nu,$  any non-negative random variable may be considered as the ‘improper likelihood ratio’  $\displaystyle {\text{d}\mu / \text{d}\nu},$  where  $\mu$  may be not normalized to 1, hence the definition.

Remark 2.  $\text{Ent}[Z]$  is a norm which captures the variation of random variables $Z$ with respect to the common dominating measure.  Indeed, as we have already demonstrated,  $\text{Ent}[Z] \ge 0,$  with equality i.f.f.  $Z$  is a fixed constant (almost surely). Second,

$\displaystyle \text{Ent}[Z] = M \cdot D(\mu_1 \| \nu),$

where  $\displaystyle M = \mathbb{E}[Z] = \int \text{d}\mu,$  and  $\mu_1 = \mu/M.$  As a result,

$\displaystyle \text{Ent}[\alpha Z] = \alpha\, \text{Ent}[Z].$

Finally, convexity of  $\displaystyle \text{Ent}[Z]$  is also easy to check.

We are now to state the tensorization result. Consider

$\displaystyle \text{Ent}_i\, f(x_1,\, ...,\, x_n) := \text{Ent}[f(x_1,\, ...,\, x_{i-1},\, X_i,\, x_{i+1},\, ...,\, x_n)].$

Theorem  (Tensorization of  $\psi$-entropy)

For any non-negative function  $f$  of independent  $X_1,\, ...,\, X_n$  it holds

$\displaystyle \text{Ent}\left[f(X_1,\, ...,\, X_n)\right] \le \mathbb{E}\left[\sum_{i=1}^n\text{Ent}_i\, f(x_1,\, ...,\, x_n) \right].$

Proof.  First, without limiting generality, assume that  $\mathbb{E} f = 1,$  i.e.

$\displaystyle f = \frac{\text{d}Q}{\text{d}P}$

for some probability measure  $Q \ll P,$  where  $P = P_1 \times ... \times P_n$  is a product probability measure since  $X_1,\, ...,\, X_n$  are independent. Rearranging Han’s inequality, we have

$\displaystyle \text{Ent}\left[f(X_1,\, ...,\, X_n)\right] = D(Q\|P) \le \sum_{i=1}^n D(Q\|P)-D(Q^{(i)}\|P^{(i)}).$

On the other hand, through some simple algebra we make sure that

$\mathbb{E} \left[\text{Ent}_i\, f(X_1,\, ...,\, X_n)\right] = D(Q\|P)-D(Q^{(i)}\|P^{(i)}).$

The general statement for  $\mathbb{E}f \ne 1$  follows by scaling due to Remark 2.  ♦