Since P(xj, yk) = P (xj)P (yk |xj), we have
H(X|Y) = H(X,Y) -H(Y).
The information transmitted by translating the variable X
into the channel transmission variable Y - possibly with error
- and then retranslating without error the transmitted Y back
into X is defined as
I(X|Y) ≡ H(X) - H(X|Y) = H(X) +H(Y) - H(X,Y)
(27)
See, for example, Ash (1990), Khinchin (1957) or Cover and
Thomas (1991) for details. The essential point is that if there
is no uncertainty in X given the channel Y , then there is no
loss of information through transmission. In general this will
not be true, and herein lies the essence of the theory.
Given a fixed vocabulary for the transmitted variable X ,
and a fixed vocabulary and probability distribution for the
channel Y , we may vary the probability distribution of X in
such a way as to maximize the information sent. The capacity
of the channel is defined as
transmission without error approaches channel capacity as a
limit. Again, Ash (1990), Khinchin (1957) and Cover and
Thomas (1991) provide details.
This approach can be, in a sense, inverted to give a tuning
theorem which parsimoniously describes the essence of the
Rate Distortion Manifold.
Telephone lines, optical wave, guides and the tenuous
plasma through which a planetary probe transmits data to
earth may all be viewed in traditional information-theoretic
terms as a noisy channel around which we must structure
a message so as to attain an optimal error-free transmission
rate.
Telephone lines, wave guides, and interplanetary plasmas
are, relatively speaking, fixed on the timescale of most mes-
sages, as are most other signaling networks. Indeed, the
capacity of a channel, is defined by varying the probability
distribution of the ‘message’ process X so as to maximize
I(X|Y).
Suppose there is some message X so critical that its prob-
ability distribution must remain fixed. The trick is to fix the
distribution P (x) but modify the channel - i.e., tune it - so
as to maximize I (X |Y ). The dual channel capacity C * can
be defined as
C* ≡ max I(X|Y).
P(Y),P(Y |X)
(29)
C≡ max I(X|Y)
P(X)
(28) But
subject to the subsidiary condition that P(X) = 1.
The critical trick of the Shannon Coding Theorem for send-
ing a message with arbitrarily small error along the channel
Y at any rate R < C is to encode it in longer and longer
‘typical’ sequences of the variable X ; that is, those sequences
whose distribution of symbols approximates the probability
distribution P (X ) above which maximizes C .
If S(n) is the number of such ‘typical’ sequences of length
n, then
log[S(n)] ≈ nH(X),
where H(X) is the uncertainty of the stochastic variable
defined above. Some consideration shows that S(n) is much
less than the total number of possible messages of length n.
Thus, as n → ∞, only a vanishingly small fraction of all pos-
sible messages is meaningful in this sense. This observation,
after some considerable development, is what allows the Cod-
ing Theorem to work so well. In sum, the prescription is to
encode messages in typical sequences, which are sent at very
nearly the capacity of the channel. As the encoded messages
become longer and longer, their maximum possible rate of
C* = max I(Y|X)
P(Y),P(Y |X)
since
I(X|Y) =H(X)+H(Y) -H(X,Y) = I(Y|X).
Thus, in a purely formal mathematical sense, the message
transmits the channel, and there will indeed be, according
to the Coding Theorem, a channel distribution P (Y) which
maximizes C* .
One may do better than this, however, by modifying the
channel matrix P(Y|X). Since
M
P(yj) = XP(xi)P(yj|xi),
i=1
P (Y) is entirely defined by the channel matrix P (Y |X ) for
fixed P (X ) and
C* = max I(Y|X) = max I(Y|X).
P(Y),P(Y |X) P(Y|X)
Calculating C* requires maximizing the complicated ex-
pression
23