Information theory background I

This is the first of a series of two posts with a goal of providing a background on information theory. For today, the itinerary is as follows: we will start with introducing basic information quantities, proceed with proving  tensorization  statements for them — i.e. that they behave well under the product measure. The tensorization, in turn, will allow us to prove a much stronger counterpart of McDiarmid’s inequality in the next post.

Basic information quantities and relations

We start with the usual  entropy  of a discrete random variable  X \in \mathcal{X} = \{x_i\}_{i \in \mathcal{I}}  with distribution  p_i = P(X = x_i)  as

\displaystyle H(X) = \mathbb{E}_{P} \log \left(\frac{1}{P(X)}\right) = -\sum_{i \in \mathcal{I}} p_i \log p_i.

Note that  H(X)  depends only on  (p_i)  and characterizes the distribution rather than the random variable  X  itself. The entropy of the ‘composite’ random variable  (X, Y)  with joint distribution  \left(\pi_{ij}\right)  is called the  joint entropy  of X, Y:

\displaystyle H(X,Y) = -\sum_{(i,j) \in \mathcal{I} \times \mathcal{J}} \pi_{ij} \log \pi_{ij},

and  conditional entropy

 \displaystyle H(Y|X) = H(X,Y)-H(X) = -\sum_{(i,j) \in \mathcal{I} \times \mathcal{J}} \pi_{ij} \log \pi_{j|i},

where  \pi_{j|i} = \frac{\pi_{ij}}{p_i}  is the conditional probability that  Y=y_j  given  X = x_i.  That is, we need  H(X,Y)  bits to record  (X,Y),  but it may be done in the alternative way, by coding  X  first, and then  Y  for each value of  X  separately. Another useful quantity is the mutual information between  X  and  Y,

 \displaystyle I(X,Y)=H(Y)-H(Y|X)=H(X)-H(X|Y) \ge 0,

i.e.  conditioning may only reduce the entropy. This Venn diagram taken from Wikipedia depicts the relations between all these quantities:


In this part of the post, we will formulate tensorization inequalities for different notions of entropy. The first of them is for the usual entropy of a discrete probability measure.

Han’s inequality

For any random variables X_1,\, ...,\, X_n,

\displaystyle H(X_1,\, ...,\, X_n) \le \frac{1}{n-1} \sum_{i=1}^n H(X_1,\, ...,\, X_{i-1}, X_{i+1},\, ...,\, X_n).

Remark.  Equality is attained for independent  (X_i)_{i=1}^n.

Proof.  Start with

H(X_1,\, ...,\, X_n)= H(X_1,\, ..., X_{i-1}, X_{i+1},\, ...,\, X_n)\, +\, H(X_i\,|\,X_1,\, ...,\, X_{i-1}, X_{i+1},\, ...,\, X_n)

 \le H(X_1,\, ..., X_{i-1}, X_{i+1},\, ...,\, X_n)\,+\,H(X_i\,|\,X_1,\, ...,\, X_{i-1})

and sum over  i. ♦

We now generalize the notion of entropy of a probability measure.


Relative entropy of a probability measure  \mu  with respect to a dominating measure  \nu \gg \mu,  that is  \nu(A) = 0 \Rightarrow \mu(A) = 0,  not necessarily normalized to 1, is

\displaystyle {D}(\mu\|\nu) = \int \text{d}\mu \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (0)

(note that since  \nu  may be not a probability measure, this quantity might be negative).  In particular, taking \nu  as a counting/Lebesgue measure  on  \mathcal{X}  recovers the usual discrete/differential entropy of  X \in \mathcal{X}:

\displaystyle H(X) = -D\left(\mu_X \| \nu_{\mathcal{X}}\right),\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (1)

and conditional entropy as

\displaystyle H(Y|X) = -D\left(\mu_{X,Y}\|\mu_X\times\nu_{\mathcal{Y}}\right);\,\,\,\,\,\,\,\,\,\, (2)

With this notation, we see that the identity

H(X, Y) = H(X) + H(Y|X)

holds solely due to the fact that both Lebesgue and counting measure are product measures, i.e.  \displaystyle \nu_{\mathcal{X}\times\mathcal{Y}}.  Given any probability measure  \mu  dominated by  \nu,  we may now define the entropy of  \mu  relative to \nu  by (1), and conditional entropy by (2), thus retaining the chain formula. Besides,  \displaystyle \int_{\mathcal{X}\times\mathcal{Y}} \text{d}\mu_{X,Y} = 1  yields

\displaystyle H(X) - H(X|Y) = \int_{\mathcal{X}\times\mathcal{Y}}\text{d}\mu_{X,Y} \log\left(\frac{\text{d}\mu_{X,Y}}{\text{d}\mu_X \,\text{d}\mu_Y}\right) = D(\mu_{X,Y}\|\mu_X \times \mu_Y) \ge 0,

since  \displaystyle \mu_X \times \mu_Y  is a probability measure as is  \displaystyle\mu_{X,Y}.  As a result, we may repeat the same chaining argument as in the proof of Han’s inequality to obtain

Theorem (Han’s inequality for relative entropies)

Given any probability measure  Q  on the product space \mathcal{X} = \mathcal{X}_1 \times ... \times \mathcal{X}_n and a dominating product measure  P \gg Q,  it holds

\displaystyle D(Q\|P) \ge \frac{1}{n-1} \sum_{i=1}^n D(Q^{(i)}\|P^{(i)}),

where  P^{(i)}, Q^{(i)}  denote  P, Q  marginalized over \mathcal{X}_i.

Return to the definition (0) and now suppose that  \nu  is normalized (i.e. a probability measure), which means that

\displaystyle {D}(\mu\|\nu) = \int \text{d}\mu \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) =\int \text{d}\nu\, \frac{\text{d}\mu}{\text{d}\nu} \, \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) = \mathbb{E}_\nu \, [Z \log Z]

for the random variable \displaystyle Z = \frac{\text{d}\mu}{\text{d}\nu}.  Note that if  \mu  is also a probability measure, then  Z  is in fact the likelihood ratio having the unit expectation, hence

\displaystyle {D}(\mu\|\nu) =\mathbb{E}_\nu \, [Z \log Z] \ge \mathbb{E}_\nu [Z] \log \mathbb{E}_\nu [Z] = 0

due to the convexity of  \psi(z) = z \log z.  If  \mu  isn’t normalized to one, then still  \displaystyle \mathbb{E}_\nu \, [Z \log Z] - \mathbb{E}_\nu [Z] \log \mathbb{E}_\nu [Z] \ge 0,  which motivates the following


The \psi-entropy  \text{Ent}[Z]  of a non-negative random variable (‘improper likelihood ratio’) Z  is

\text{Ent}[Z]=\mathbb{E} \, [Z \log Z] - \mathbb{E} [Z] \log \mathbb{E}[Z].

Remark 1.  Given a dominating measure \nu,  any non-negative random variable may be considered as the ‘improper likelihood ratio’  \displaystyle {\text{d}\mu / \text{d}\nu},  where  \mu  may be not normalized to 1, hence the definition.

Remark 2.  \text{Ent}[Z]  is a norm which captures the variation of random variables Z with respect to the common dominating measure.  Indeed, as we have already demonstrated,  \text{Ent}[Z] \ge 0,  with equality i.f.f.  Z  is a fixed constant (almost surely). Second,

\displaystyle \text{Ent}[Z] = M \cdot D(\mu_1 \| \nu),

where  \displaystyle M = \mathbb{E}[Z] = \int \text{d}\mu,  and  \mu_1 = \mu/M.  As a result,

\displaystyle \text{Ent}[\alpha Z] = \alpha\, \text{Ent}[Z].

Finally, convexity of  \displaystyle \text{Ent}[Z]  is also easy to check.

We are now to state the tensorization result. Consider

\displaystyle \text{Ent}_i\, f(x_1,\, ...,\, x_n) := \text{Ent}[f(x_1,\, ...,\, x_{i-1},\, X_i,\, x_{i+1},\, ...,\, x_n)].

Theorem  (Tensorization of  \psi-entropy)

For any non-negative function  f  of independent  X_1,\, ...,\, X_n  it holds

\displaystyle \text{Ent}\left[f(X_1,\, ...,\, X_n)\right] \le \mathbb{E}\left[\sum_{i=1}^n\text{Ent}_i\, f(x_1,\, ...,\, x_n) \right].

Proof.  First, without limiting generality, assume that  \mathbb{E} f = 1,  i.e.

\displaystyle f = \frac{\text{d}Q}{\text{d}P}

for some probability measure  Q \ll P,  where  P = P_1 \times ... \times P_n  is a product probability measure since  X_1,\, ...,\, X_n  are independent. Rearranging Han’s inequality, we have

\displaystyle \text{Ent}\left[f(X_1,\, ...,\, X_n)\right] = D(Q\|P) \le \sum_{i=1}^n D(Q\|P)-D(Q^{(i)}\|P^{(i)}).

On the other hand, through some simple algebra we make sure that

\mathbb{E} \left[\text{Ent}_i\, f(X_1,\, ...,\, X_n)\right] = D(Q\|P)-D(Q^{(i)}\|P^{(i)}).

The general statement for  \mathbb{E}f \ne 1  follows by scaling due to Remark 2.  ♦

About Dmitry Ostrovsky

PhD student with interests in statistics, optimization, and machine learning.
This entry was posted in Course on Concentration Inequalities and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s