Last time we considered the information in a single event and determined that a decent measure of that information might be:
I(e) = − log P(e)
Suppose, however, that we are not interested in the information in a single event, but in a source of information. We can think of this as a source of events, or a source generating signals. All that really matters is that there are a number of possibilities, and they appear with different frequencies. Let's use π to denote a source and pi to denote the probability of the ith outcome amongst the n possible outcomes π might generate.
What can we say about the information generated by this source? Well, suppose we want a measure of the average amount of information in any signal coming from this source. This could mean something like "How chaotic is this source?" Suppose we are looking at fluctuations in background radiation—are they the result of chance (i.e. chaotic) or are they remnants of communications from an alien civilization?
An equivalent interpretation is this: "How efficient is the coding of this source's signal?" If there is lots of redundancy, then there is less information in each event generated by a source. For example, if the source is generating letters of the alphabet and it generates "th" then we would not be surprised if the next symbol is "e"—if the source is generating English text, that is. This is because there is a lot of redundancy in written English. Whenever the probability associated with the next symbol drops below 1/26, i.e. whenever some symbols are more likely than others, we receive less information when the likely symbols occur.
A third interpretation here is this: "How predictable is the source?" Suppose the source is generated by the toss of a fair die and the symbol "x" is sent if the outcome is 1, 2, 3, 4, or 5, but the symbol "y" is sent if the outcome is 6. Then "x" should appear five times more frequently than "y" and we receive much less information whenever "x" occurs. So, if all we do is predict "x", we'll be right five times out of six.
So, a chaotic source is an efficiently coded source is an unpredictable source. But how can we put a measure on this property? Claude Shannon (1948) "The Mathematical Theory of Communication" argues that we need a function H which will satisfy four intuitive properties:
1. H should be continuous over the event probabilities—i.e. small changes in the probability distribution should only have small changes on H.
2. If the probability of all events generated by a source is equal (i.e. for n events, the probability of each is 1/n), then the value of H should increase as the number of possibilities increases (intuitively, there is more information in a die toss than a coin flip).
3. If the end results (the sequence of events) are generated by different probabilistic processes, but the ultimate probability of each symbol is the same, then H should return the same value in each case. For example, consider two sources, π1 and π2. π1 first flips a fair coin. If it comes up heads, it sends "a", if it comes up tails, then π1 rolls a fair die and sends "b" if the outcome if 1, 2, 3, or 4 and "c" if the outcome is 5 or 6. π2 just rolls a fair die and sense "a" if the outcome is 1, 2, or 3; "b" if the outcome if 4 or 5; and "c" if the outcome is 6. Although the signals are generated by different mechanisms, in both cases, "a" will appear with probability 1/2; "b" will appear with probability 1/3; and "c" will appear with probability 1/6. So, H should assign the same value to both π1 and π2.
Shannon proves that all functions which satisfy these three desiderata will be of the form:
Where K is a constant (essentially, it sets an origin) and the basis of the log represents a choice of units.
Shannon called H the entropy of a source. This is because it happens to be the same equation as is used to calculate entropy in statistical mechanics (note, however, that the derivation of the equation is totally different). Is there any deep connection between entropy in information theory and entropy in statistical mechanics? This is an open question about which opinions differ wildly.
Thursday, July 30, 2009
Thursday, July 23, 2009
information i
Suppose we observe an event e — how much information do we acquire?
We'd like a function I such that for any e, I(e) tells us how much information is contained in the occurrence of e.
Intuitively, if the probability of e occurring is very large, then I(e) should be very high. We learn a lot when something unlikely occurs. If your math teacher stands on one leg, barks like a dog, turns around three times, then leaps through the window, you learn a lot about him.
If the probability of e occurring is very high, though, I(e) should be very small. In the limit, as the probability of e approaches 1, i.e. certainty, I(e) should approach zero — we don't learn very much when we see the sun rise in the morning (at least, not if we've been paying attention!).
In the limit in the other direction, as the probability of e approaches zero, the information we gain from observing e approaches infinity.
Finally, if two unconnected (i.e. independent) events occur together then the information we receive about them is just the sum of the information we would have received had each of them occurred separately. (Remember that two events are independent if the probability of both of them occurring equals the probability of one times the probability of the other.) If my math teacher puts a banana on his head, and your english teacher, in a different state, smothers himself in jello, then we learn everything I would have learned separately about my math teacher, plus everything you would have learned separately about your english teacher.
So, we now have several constraints on the function I:
1. P(e1) < P(e2) implies I(e1) > I(e2)
2. P(e) = 1 implies I(e) = 0
3. P(e) = 0 implies I(e) = ∞
4. P(e1 ∧ e2) = P(e1) × P(e2) implies I(e1 ∧ e2) = I(e1) + I(e2)
One convenient analysis of information which satisfies these four constraints is
I(e) = − log P(e)
We'd like a function I such that for any e, I(e) tells us how much information is contained in the occurrence of e.
Intuitively, if the probability of e occurring is very large, then I(e) should be very high. We learn a lot when something unlikely occurs. If your math teacher stands on one leg, barks like a dog, turns around three times, then leaps through the window, you learn a lot about him.
If the probability of e occurring is very high, though, I(e) should be very small. In the limit, as the probability of e approaches 1, i.e. certainty, I(e) should approach zero — we don't learn very much when we see the sun rise in the morning (at least, not if we've been paying attention!).
In the limit in the other direction, as the probability of e approaches zero, the information we gain from observing e approaches infinity.
Finally, if two unconnected (i.e. independent) events occur together then the information we receive about them is just the sum of the information we would have received had each of them occurred separately. (Remember that two events are independent if the probability of both of them occurring equals the probability of one times the probability of the other.) If my math teacher puts a banana on his head, and your english teacher, in a different state, smothers himself in jello, then we learn everything I would have learned separately about my math teacher, plus everything you would have learned separately about your english teacher.
So, we now have several constraints on the function I:
1. P(e1) < P(e2) implies I(e1) > I(e2)
2. P(e) = 1 implies I(e) = 0
3. P(e) = 0 implies I(e) = ∞
4. P(e1 ∧ e2) = P(e1) × P(e2) implies I(e1 ∧ e2) = I(e1) + I(e2)
One convenient analysis of information which satisfies these four constraints is
I(e) = − log P(e)
Friday, July 17, 2009
Wednesday, July 15, 2009
Subscribe to:
Posts (Atom)