This is the first of a series of two posts with a goal of providing a background on information theory. For today, the itinerary is as follows: we will start with introducing basic information quantities, proceed with proving tensorization statements for them — i.e. that they behave well under the product measure. The tensorization, in turn, will allow us to prove a much stronger counterpart of McDiarmid’s inequality in the next post.
Basic information quantities and relations
We start with the usual entropy of a discrete random variable with distribution as
Note that depends only on and characterizes the distribution rather than the random variable itself. The entropy of the ‘composite’ random variable with joint distribution is called the joint entropy of
and conditional entropy
where is the conditional probability that given That is, we need bits to record but it may be done in the alternative way, by coding first, and then for each value of separately. Another useful quantity is the mutual information between and
i.e. conditioning may only reduce the entropy. This Venn diagram taken from Wikipedia depicts the relations between all these quantities:
In this part of the post, we will formulate tensorization inequalities for different notions of entropy. The first of them is for the usual entropy of a discrete probability measure.
For any random variables
Remark. Equality is attained for independent
Proof. Start with
and sum over ♦
We now generalize the notion of entropy of a probability measure.
With this notation, we see that the identity
holds solely due to the fact that both Lebesgue and counting measure are product measures, i.e. Given any probability measure dominated by we may now define the entropy of relative to by (1), and conditional entropy by (2), thus retaining the chain formula. Besides, yields
since is a probability measure as is As a result, we may repeat the same chaining argument as in the proof of Han’s inequality to obtain
Theorem (Han’s inequality for relative entropies)
Given any probability measure on the product space and a dominating product measure it holds
where denote marginalized over
Return to the definition (0) and now suppose that is normalized (i.e. a probability measure), which means that
for the random variable Note that if is also a probability measure, then is in fact the likelihood ratio having the unit expectation, hence
due to the convexity of If isn’t normalized to one, then still which motivates the following
The -entropy of a non-negative random variable (‘improper likelihood ratio’) is
Remark 1. Given a dominating measure any non-negative random variable may be considered as the ‘improper likelihood ratio’ where may be not normalized to 1, hence the definition.
Remark 2. is a norm which captures the variation of random variables with respect to the common dominating measure. Indeed, as we have already demonstrated, with equality i.f.f. is a fixed constant (almost surely). Second,
where and As a result,
Finally, convexity of is also easy to check.
We are now to state the tensorization result. Consider
Theorem (Tensorization of -entropy)
For any non-negative function of independent it holds
Proof. First, without limiting generality, assume that i.e.
for some probability measure where is a product probability measure since are independent. Rearranging Han’s inequality, we have
On the other hand, through some simple algebra we make sure that
The general statement for follows by scaling due to Remark 2. ♦