Suppose we observe an event e — how much information do we acquire?
We'd like a function I such that for any e, I(e) tells us how much information is contained in the occurrence of e.
Intuitively, if the probability of e occurring is very large, then I(e) should be very high. We learn a lot when something unlikely occurs. If your math teacher stands on one leg, barks like a dog, turns around three times, then leaps through the window, you learn a lot about him.
If the probability of e occurring is very high, though, I(e) should be very small. In the limit, as the probability of e approaches 1, i.e. certainty, I(e) should approach zero — we don't learn very much when we see the sun rise in the morning (at least, not if we've been paying attention!).
In the limit in the other direction, as the probability of e approaches zero, the information we gain from observing e approaches infinity.
Finally, if two unconnected (i.e. independent) events occur together then the information we receive about them is just the sum of the information we would have received had each of them occurred separately. (Remember that two events are independent if the probability of both of them occurring equals the probability of one times the probability of the other.) If my math teacher puts a banana on his head, and your english teacher, in a different state, smothers himself in jello, then we learn everything I would have learned separately about my math teacher, plus everything you would have learned separately about your english teacher.
So, we now have several constraints on the function I:
1. P(e1) < P(e2) implies I(e1) > I(e2)
2. P(e) = 1 implies I(e) = 0
3. P(e) = 0 implies I(e) = ∞
4. P(e1 ∧ e2) = P(e1) × P(e2) implies I(e1 ∧ e2) = I(e1) + I(e2)
One convenient analysis of information which satisfies these four constraints is
I(e) = − log P(e)