Meetup summary

2026-04-17 - Information theory basics

Agenda:

I’ll be covering the seeds of information theory today. Terminology and notation is somewhat non-standardized, but I’m planning to bootstrap this using a view that is mostly self-contained and clean, but will lean on some of the basic notions (and notations) of probability that we developed in our intro sessions. I’m not planning to expand deeper on this content, as this should give us enough to get started with machine learning applications, but I do plan to eventually cover the whole agenda if we can’t get through it today.

Self-information
- 3 axioms
- Brainstorming session (what functions could fit?)
- Shannon’s function
- Warning: not all resources/papers use the term “self-information” in the same was as we do here. It sometimes refers to entropy or related quantities.
Entropy (of a distribution)
- Expectation of a self-information over an entire (unconditional) sample space
- Intuitions about entropy (what “high” or “low” entropy implies; this will be useful for tying in conditional entropy and mutual information later)
Extension: differential self-information and entropy.
- Caveats (unnormalized, unbounded, etc.) Loses some nice properties we got when deriving entropy in the first place.
Conditional entropy (definition)
Cross entropy (definition)
Kullback-Leibler divergence (definition, as a difference between true distributional entropy and cross entropy of a target or “model” distribution)
Claim: entropy represents the smallest possible average code length under some given distribution (assuming iid input messages)!
~~Proof sketch of the code optimality claim:~~
- ~~Kraft inequality (for “immediate” or prefix codes; reflects how you would code in practice)~~
- Kraft-McMillan inequality. This is a generalization of the above for all non-singular codes. It’s interesting in that it rules out the possibility of improving on optimality by using some convoluted coding strategy (e.g., at the expense of runtime complexity).
- ~~Gibb’s inequality (will include a self-contained and simple proof)~~
- ~~Tying it all together: Gibbs inequality with Kraft-McMillan to constrain targets to a probabililty distribution.~~
Proof sketch of optimality for differential (continuous) entropy: same as above, but substitute Jensen’s inequality for Gibbs. I will not be proving Jensen’s inequality here as it requires more machinery, but I can give a sketch of how you would apply it as above.
Mutual information (definition and interpretation in terms of other operations we know)
Joint entropy. This is not a particularly useful notion, but I’m calling it out to avoid confusion around terminology. This should be distinguished from conditional entropy, cross entropy, and “relative entropy” (a little-used term for KL divergence).

Notes:

We covered everything up to the claim (but not proof) that the entropy of a distribution represents its “minimal code length” in some sense (but did not define a coding abstraction precisely). We’ll pick up from there next time.

Meetup summary

2026-04-17 - Information theory basics

Recommended reading:

Agenda:

Notes: