This is the first of a series of two posts with a goal of providing a background on information theory. For today, the itinerary is as follows: we will start with introducing basic information quantities, proceed with proving *tensorization *statements for them — i.e. that they behave well under the product measure. The tensorization, in turn, will allow us to prove a much stronger counterpart of McDiarmid’s inequality in the next post.

**Basic information quantities and relations**

We start with the usual *entropy* of a discrete random variable with distribution as

Note that depends only on and characterizes the distribution rather than the random variable itself. The entropy of the ‘composite’ random variable with joint distribution is called the *joint entropy *of

and *conditional entropy*

where is the conditional probability that given That is, we need bits to record but it may be done in the alternative way, by coding first, and then for each value of separately. Another useful quantity is the *mutual information *between and

i.e. conditioning may only reduce the entropy. This Venn diagram taken from Wikipedia depicts the relations between all these quantities:

**Tensorization**

In this part of the post, we will formulate tensorization inequalities for different notions of entropy. The first of them is for the usual entropy of a discrete probability measure.

Han’s inequalityFor any random variables

**Remark. **Equality is attained for independent

**Proof. **Start with

and sum over ♦

We now generalize the notion of entropy of a probability measure.

Definition

Relative entropy of a probability measurewith respect to a dominating measurethat is not necessarily normalized to 1, is

(note that since may be not a probability measure, this quantity might be negative). In particular, taking as a counting/Lebesgue measure on recovers the usual discrete/differential entropy of

With this notation, we see that the identity

holds solely due to the fact that both Lebesgue and counting measure are *product measures*, i.e. Given any probability measure dominated by we may now *define* the entropy of relative to by (1), and conditional entropy by (2), thus retaining the chain formula. Besides, yields

since is a probability measure as is As a result, we may repeat the same chaining argument as in the proof of Han’s inequality to obtain

Theorem (Han’s inequality for relative entropies)Given any

probability measureon the product space and adominatingproductmeasureit holdswhere denote marginalized over

Return to the definition (0) and now suppose that is normalized (i.e. a probability measure), which means that

for the random variable Note that if is also a probability measure, then is in fact the likelihood ratio having the unit expectation, hence

due to the convexity of If isn’t normalized to one, then still which motivates the following

DefinitionThe

-entropy of a non-negative random variable (‘improper likelihood ratio’)is

**Remark 1. **Given a dominating measure any non-negative random variable may be considered as the ‘improper likelihood ratio’ where may be not normalized to 1, hence the definition.

**Remark 2.** is a norm which captures the variation of random variables with respect to the common dominating measure. Indeed, as we have already demonstrated, with equality i.f.f. is a fixed constant (almost surely). Second,

where and As a result,

Finally, convexity of is also easy to check.

We are now to state the tensorization result. Consider

Theorem (Tensorization of -entropy)For any non-negative function of independent it holds

**Proof. **First, without limiting generality, assume that i.e.

for some probability measure where is a product probability measure since are independent. Rearranging Han’s inequality, we have

On the other hand, through some simple algebra we make sure that

The general statement for follows by scaling due to Remark 2. ♦