Meetup summary
2026-04-17 - Intro to information theory
Recommended reading:
None. However, here are some useful references if you’d like additional background or further reading. I’m not planning to drill into information theory beyond what we cover in today’s agenda (though that actual coverage may bleed into other sessions).
- The Mathematical Theory of Communication by Claude Shannon. This was the seminal introduction of information theory.
- Elements of Information Theory by Thomas M. Cover and Joy A. Thomas. I pulled McMillan’s generalization of Kraft’s inequality from this and also borrowed some notation. The basic Kraft inequality is “obvious” and I don’t recall where I learned it.
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. I borrowed the notion of bootstrapped “self-information” from here.
Agenda:
I’ll be covering the seeds of information theory today. Terminology and notation is somewhat non-standardized, but I’m planning to bootstrap this using a view that is mostly self-contained and clean, but will lean on some of the basic notions (and notations) of probability that we developed in our intro sessions. I’m not planning to expand deeper on this content, as this should give us enough to get started with machine learning applications, but I do plan to eventually cover the while agenda if we can’t get through it today.
- Self-information
- 3 axioms
- Brainstorming session (what functions could fit?)
- Shannon’s function
- Warning: not all resources/papers use the term “self-information” in the same was as we do here. It sometimes refers to entropy or related quantities.
- Entropy (of a distribution)
- Expectation of a self-information over an entire (unconditional) sample space
- Intuitions about entropy (what “high” or “low” entropy implies; this will be useful for tying in conditional entropy and mutual information later)
- Extension: differential self-information and entropy.
- Caveats (unnormalized, unbounded, etc.) Loses some nice properties we got when deriving entropy in the first place.
- Conditional entropy (definition)
- Cross entropy (definition)
- Kullback-Leibler divergence (definition, as a difference between true distributional entropy and cross entropy of a target or “model” distribution)
- Claim: entropy represents the smallest possible average code length under some given distribution (assuming iid input messages)!
- Proof sketch of the code optimality claim:
- Kraft inequality (for “immediate” or prefix codes; reflects how you would code in practice)
- Kraft-McMillan inequality. This is a generalization of the above for all non-singular codes. It’s interesting in that it rules out the possibility of improving on optimality by using some convoluted coding strategy (e.g., at the expense of runtime complexity).
- Gibb’s inequality (will include a self-contained and simple proof)
- Tying it all together: Gibbs inequality with Kraft-McMillan to constrain targets to a probabililty distribution.
- Proof sketch of optimality for differential (continuous) entropy: same as above, but substitute Jensen’s inequality for Gibbs. I will not be proving Jensen’s inequality here as it requires more machinery, but I can give a sketch of how you would apply it as above.
- Mutual information (definition and interpretation in terms of other operations we know)
- Joint entropy. This is not a particularly useful notion, but I’m calling it out to avoid confusion around terminology. This should be distinguished from conditional entropy, cross entropy, and “relative entropy” (a little-used term for KL divergence).
(Apologies if I stumble through any of this. It’s been a long time since I’ve reviewed this material, but I’m confident enough about my grasp that I’m willing to give it a shot cold. 😅 This stuff is widely available online and I noted the primary resources I recall learning from.)
tags: