Conditional entropy example. Entropy of a source of continuous messages

Considering Shannon's formula (3.3) for calculating entropy random variable and the amount of information, we assumed that information about the random variable (X) comes directly to the observer. However, as a rule, we receive information not about the random variable (X) that interests us, but about some other one (Y), which is related to X in a stochastic way. Such a connection of random variables differs from a functional connection, in which each value of one value corresponds to a single, well-defined value of another value. The stochastic (probabilistic) connection between two random variables X and Y means that a change in one of them affects the value of the other, but in such a way that knowing the value of X it is impossible to accurately indicate the value that the value Y will take. You can only indicate the trend of change in the value Y.

Let B be a random event; p(B) – probability of its occurrence; let us denote by X a random variable that takes N different meanings(x 1 , x 2 , … x N ), and through A k the event that the random variable X will take the value x k:

A k = ( X = x k ), k=1,2, …N ;

We denote the probability of the event A k by p(A k). The probability of some events occurring may change depending on whether some other event occurs or not. The probability p B (A k) of the event A k, calculated under the assumption that event B has occurred, is called the conditional probability of the event A k, in this case:

Events A k and B are called independent if the probability of occurrence of event A k does not depend on whether event B has occurred or not. This means that the conditional probability of event p B (A k) is equal to the “ordinary” probability p(A k).

Definition. The conditional entropy of a random variable X under condition B is the quantity

(4.2)

The difference from Shannon’s formula (3.3) is that instead of probabilities p(A k), conditional probabilities p B (A k) are used.

Let now Y be another random variable taking values (y 1 , y 2 , ... y M ). Let us denote by B j the event that the random variable Y takes on the value y j:

B j = ( Y = y j ), j=1, 2,… M.

We denote the probability of event B j by p(B j).

Definition. The conditional entropy of the random variable X at set value random variable Y is the quantity H Y (X)

(4.3)

Let us transform formula (4.3):

Formula (4.3) takes the form:

(4.4)

Let us calculate the amount of information about the random variable X obtained by observing the random variable Y. This amount of information I(X,Y) is equal to the decrease in entropy of the random variable X when observing the random variable Y:

Let us substitute the expressions for H(X) and H Y (X) into (15):

In the first sum we replace p(A k)=p(A k B 1)+ p(A k B 2)+ p(A k B 3)…+ p(A k B M). This equality really takes place, because the events A k B 1 , A k B 2 , … A k B M are pairwise incompatible, and one of them will occur if A k occurs. Conversely, if one of B j occurs, then A k also occurs. Continuing the transformations, we get:

So, we have a formula for calculating the amount of information about a random variable X when observing another random variable Y:

(4.6)

If random variables (or events) are independent, then the relation p(A k B j) = p(A k)p(B j) holds for them - the probability of the joint occurrence of two events is equal to the product of the probabilities of these events.

Regarding the value I(X,Y), the following statements are true.

For independent random variables we obtain

This means that observing the random variable Y will not provide any advantage in obtaining information about the random variable X.

In other cases, I(X,Y) >0, and the following inequality holds:

Equality is achieved if there is a functional connection Y=F(X). In this case, observing Y gives full information about X. If Y=X, then I(X,X) = H(X).

The quantity I(X,Y) is symmetrical: I(X,Y) = I(Y,X). This means that the observation of a random variable Y provides the same amount of information about the random variable X as the observation of a random variable X provides regarding the random variable Y. If we consider two random variables that are in a stochastic dependence, then by means of information theory it is impossible to establish which one which is the cause and which is the effect.

For further presentation we will need some known information from probability theory.

1) Properties of probabilities for an ensemble random events A And IN:

P(A,B)=P(A)*P(B/A); -> P(B/A)=P(A,B)/P(B);

P(A,B)=P(B)*P(B/A); -> P(A/B)=P(A,B)/P(A);

P(A/B)=P(A)*P(B/A)/P(B); P(B/A)=P(B)*P(A/B)/P(A);

If A And IN are independent, then

P(A/B)=P(A); P(B/A)=P(B):

P(A,B)=P(A)*P(B);

Once again, the definition of Shannon entropy for a source of discrete messages:

Its properties:

H > 0 ;

Nmax = log N;

With independent sources H(A,B)=H(A)+H(B);

CONDITIONAL ENTROPY

If the states of the system elements do not depend on each other or if the state of one system does not depend on the state of another system, then the uncertainty that some element of the system (or some system) will be in one of the possible states would be completely determined by the probabilistic characteristics of the individual elements of the system . In this case specific amount information per state of a system element or per message symbol is called average entropy, and when calculating it, the expression is used

When calculating the average amount of information per message symbol, interdependence is taken into account through conditional probabilities of the occurrence of some events relative to others, and the resulting entropy is called conditional entropy.

Let us consider the transmission of messages from a source of random symbols A through an information transmission channel. In this case, it is assumed that with reliable transmission when transmitting the symbol a 1 we obtain b 1 , a 2 - b 2 etc. In this case, for a channel with interference, the transmission is distorted, and when a symbol is received b 1 we can only talk about the probability of retransmission of the symbol a 1 . It may well be that the characters were transmitted a 2 , a 3 etc.

Distortions are described by the matrix of conditional channel probabilities P(A/ B)={ p(a i / b i }.

Let's consider the process of transmitting signals over a communication channel with noise and use it to understand the mechanism for calculating conditional entropy.

If the message source produces the characters

a l , A 2 , ..., a i ..., A n

with probabilities accordingly

p(a 1 ), p (a 2 ) ... ..., p (a i ), ..., p (a n ),

and at the output of the transmission channel we receive symbols

b 1 ,b 2 , ..., b i ..., b n

with probabilities accordingly

p(b 1 ), p (b 2 ), ..., p (b i , ..., p (b n ),

then the concept of conditional entropy H (B/a i ) expresses the uncertainty that by sending a i , we will get b i., concept H(A/b i ) uncertainty that remains after receiving b i in what was sent exactly a i. This is represented graphically in the figure above. If there is interference in the communication channel, then any of the signals can be received with varying degrees of probability b j and, conversely, the received signal b j may appear as a result of sending any of the signals a i . If there is no interference in the communication channel, then the sent symbol is always A 1 matches the accepted character b 1 , A 2 - b 2 , ..., A n - b n .

In this case, the entropy of the message source H(A) is equal to the entropy of the message receiver H(B). If there is interference in the communication channel, it destroys or distorts part of the transmitted information.

Information losses are completely described through private and general conditional entropy. It is convenient to calculate partial and general conditional entropy using channel matrices. The term "channel matrix" means: a matrix that statistically describes this channel connection, used for brevity. If the communication channel is described from the side of the message source (i.e. the sent signal is known), then the probability that when transmitting the signal a i through a communication channel with interference we will receive a signal b j denoted as conditional probability p(b j /ai). and the channel matrix has the form

The probabilities that are located along the diagonal (in bold) determine the probabilities of a correct reception, the rest - a false one. The values of the digits filling the columns of the channel matrix usually decrease with distance from the main diagonal, and in the complete absence of interference, all but the digits located on the main diagonal are equal to zero.

Passing the symbol a i from the side of the message source in a given communication channel is described by the distribution of conditional probabilities of the form p(b j /a i ), the sum of the probabilities must always be equal to one. For example, for a signal A 1

Information losses per signal share a i are described using partial conditional entropy. For example, for a signal a 1

The summation is carried out according to j, because i-th state (in in this case first) remains constant.

Transmission loss all signals over a given communication channel are described using general conditional entropy. To calculate it, you should sum up all partial conditional entropies, i.e., perform a double summation over i and by j.

In the case of unequal probability of occurrence of message source symbols, the probability of the appearance of each symbol should be taken into account by multiplying the corresponding partial conditional entropy by it. In this case, the total conditional entropy

If we examine the situation from the outside message receiver(that is when the received signal is known) , then then with the receipt of the symbol b j it is assumed that one of the symbols was sent a 1 , a 2 , …, a i ,…, a m. In this case, the channel matrix has the form:

In this case, the sums of conditional probabilities should be equal to one not in the rows, but in the columns of the channel matrix

Partial conditional entropy

And the total conditional entropy

Total conditional entropy of the system B relative to system A characterizes the amount of information contained in any symbol of the message source through which we represent the states of the elements of the systems under study.

The general conditional entropy is determined by averaging over all symbols, i.e., over all states A i taking into account the probability of occurrence of each of them. It is equal to the sum of the products of the probabilities of the appearance of source symbols and the uncertainty that remains after the addressee has received the symbols:

If there is no interference in the communication channel, then all elements of the channel matrix, except those located on the main diagonal, are equal to zero. This suggests that when transmitting a signal A 1 we will definitely get b 1 upon transmission A 2 - b 2 , ..., A m - b m. The probability of receiving the correct signal will become unconditional, and conditional entropy will be zero.

Conditional entropy reaches its maximum in the case when, when transmitting a symbol A i maybe with equal probability any of the signals received b 1 , b 2 , ..., b m .

Conditional entropy

Entropy (informational)- a measure of information chaos, the uncertainty of the appearance of any symbol of the primary alphabet. In the absence of information losses, it is numerically equal to the amount of information per symbol of the transmitted message.

For example, in the sequence of letters that make up a sentence in Russian, different letters appear with different frequencies, so the uncertainty of occurrence is less for some letters than for others. If we take into account that some combinations of letters (in this case we talk about entropy n-th order, see) are very rare, then the uncertainty is further reduced.

To illustrate the concept of information entropy, one can also resort to an example from the field of thermodynamic entropy, called Maxwell's demon. The concepts of information and entropy have deep connections with each other, but despite this, the development of theories in statistical mechanics and information theory took many years to make them consistent with each other.

Formal definitions

Determination using your own information

You can also determine the entropy of a random variable by first introducing the concept of distribution of a random variable X having final number values:

I(X) = − log P X (X).

Then entropy will be defined as:

The unit of measurement of information and entropy depends on the base of the logarithm: bit, nat or hartley.

Information entropy for independent random events x With n possible conditions(from 1 to n) is calculated using the formula:

This quantity is also called average message entropy. The quantity is called private entropy, characterizing only i-e state.

Thus, the entropy of the event x is the sum with opposite sign all works relative frequencies occurrence of an event i, multiplied by their own binary logarithms (base 2 was chosen only for the convenience of working with information presented in binary form). This definition for discrete random events can be extended to a probability distribution function.

In general b-ary entropy(Where b equals 2, 3, ...) source with the original alphabet and discrete distribution probabilities where p i is the probability a i (p i = p(a i) ) is determined by the formula:

The definition of Shannon entropy is related to the concept of thermodynamic entropy. Boltzmann and Gibbs did great job By statistical thermodynamics, which contributed to the adoption of the word "entropy" in information theory. There is a connection between thermodynamic and information entropy. For example, Maxwell's demon also contrasts thermodynamic entropy information, and gaining any amount of information equals lost entropy.

Alternative definition

Another way to define the entropy function is H is proof that H is uniquely determined (as stated earlier) if and only if H satisfies the conditions:

Properties

It is important to remember that entropy is a quantity defined in context probabilistic model for the data source. For example, tossing a coin has entropy − 2(0.5log 2 0.5) = 1 bit per toss (assuming it is independent). A source that generates a string consisting only of the letters “A” has zero entropy: . So, for example, it can be experimentally established that entropy English text is equal to 1.5 bits per character, which of course will vary for different texts. The degree of entropy of a data source means the average number of bits per data element required to encrypt it without loss of information, with optimal encoding.

Some data bits may not carry information. For example, data structures often store redundant information, or have identical sections regardless of the information in the data structure.
The amount of entropy is not always expressed as an integer number of bits.

Mathematical properties

Efficiency

The original alphabet encountered in practice has a probability distribution that is far from optimal. If the original alphabet had n characters, then it can be compared to an “optimized alphabet” whose probability distribution is uniform. The entropy ratio of the original and optimized alphabet is the efficiency of the original alphabet, which can be expressed as a percentage.

It follows from this that the effectiveness of the original alphabet with n symbols can be defined simply as equal to its n-ary entropy.

Entropy limits the maximum possible lossless (or nearly lossless) compression that can be realized using theoretically typical set or, in practice, Huffman coding, Lempel–Ziv–Welch coding, or arithmetic coding.

Variations and generalizations

Conditional entropy

If the sequence of alphabet characters is not independent (for example, in French the letter “q” is almost always followed by a “u”, and the word “advanced” in Soviet newspapers usually followed by the word “production” or “labor”), the amount of information carried by a sequence of such symbols (and therefore entropy) is obviously less. To take into account such facts, conditional entropy is used.

First-order conditional entropy (similar to the first-order Markov model) is the entropy for an alphabet where the probabilities of one letter appearing after another are known (that is, the probabilities of two-letter combinations):

Where i is a state dependent on the preceding character, and p i (j) - this is the probability j, provided that i was the previous character.

So, for the Russian language without the letter "".

Information losses during data transmission in a noisy channel are fully described through private and general conditional entropies. For this purpose, so-called channel matrices. So, to describe losses on the part of the source (that is, the sent signal is known), the conditional probability of receiving the symbol by the receiver is considered b j provided that the character was sent a i. In this case, the channel matrix has the following form:

	b 1	b 2	…	b j	…	b m
a 1			…		…
a 2			…		…
…	…	…	…	…	…	…
a i			…		…
…	…	…	…	…	…	…
a m			…		…

Obviously, the probabilities located along the diagonal describe the probability of correct reception, and the sum of all elements of the column will give the probability of the corresponding symbol appearing on the receiver's side - p(b j) . Losses per transmitted signal a i, are described through partial conditional entropy:

To calculate the transmission losses of all signals, the general conditional entropy is used:

It means entropy on the source side; the entropy on the receiver side is considered in a similar way: instead of everywhere it is indicated (by summing the elements of the line you can get p(a i) , and the diagonal elements mean the probability that the exact character that was received was sent, that is, the probability of correct transmission).

Mutual entropy

Mutual entropy, or union entropy, is intended for calculating the entropy of interconnected systems (the entropy of the joint occurrence of statistically dependent messages) and is denoted H(AB) , Where A, as always, characterizes the transmitter, and B- receiver.

The relationship between transmitted and received signals is described by probabilities joint events p(a i b j) , and for full description channel characteristics, only one matrix is required:

p(a 1 b 1)	p(a 1 b 2)	…	p(a 1 b j)	…	p(a 1 b m)
p(a 2 b 1)	p(a 2 b 2)	…	p(a 2 b j)	…	p(a 2 b m)
…	…	…	…	…	…
p(a i b 1)	p(a i b 2)	…	p(a i b j)	…	p(a i b m)
…	…	…	…	…	…
p(a m b 1)	p(a m b 2)	…	p(a m b j)	…	p(a m b m)

For more general case, when it is not a channel that is being described, but simply interacting systems, the matrix does not have to be square. Obviously, the sum of all elements of the column with number j will give p(b j) , the sum of the line number i There is p(a i) , and the sum of all matrix elements is equal to 1. Joint probability p(a i b j) events a i And b j is calculated as the product of the original and conditional probability,

Conditional probabilities are produced using Bayes' formula. Thus, there is all the data for calculating the entropies of the source and receiver:

Mutual entropy is calculated by sequentially summing over rows (or columns) all probabilities of the matrix, multiplied by their logarithm:

H(AB) = −	∑	∑	p(a i b j)log p(a i b j).
	i	j

The unit of measurement is bit/two symbols, this is explained by the fact that mutual entropy describes the uncertainty per pair of symbols - sent and received. By simple transformations we also obtain

Mutual entropy has the property information completeness- from it you can get all the quantities under consideration.

	b 1	b 2	…	b j	…	b m
a 1			…		…
a 2			…		…
…	…	…	…	…	…	…
a i			…		…
…	…	…	…	…	…	…
a m			…		…

	b 1	b 2	…	b j	…	b m
a 1			…		…
a 2			…		…
…	…	…	…	…	…	…
a i			…		…
…	…	…	…	…	…	…
a m			…		…

	b 1	b 2	…	b j	…	b m
a 1			…		…
a 2			…		…
…	…	…	…	…	…	…
a i			…		…
…	…	…	…	…	…	…
a m			…		…