Meetup summary

2026-04-24 - Information theory - optimal coding

Agenda:

This picks up where we left off last time, focusing on the theme of optimal coding. We’ll first define what a code is and what the assumptions are within the standard framework of information theory, then we’ll prove some useful properties and wrap up with related quantities.

Define a code and related terms
- Prefix codes (also known as immediate codes)
- The extension of a code
- Singular and non-singular codes. (Mathematically minded people might call the latter injective codes, but I haven’t seen that term used in this context.)
- When discussing properties, compressibility, and average code length, we always assume input symbols are iid.¹²
Claim: entropy represents the smallest possible average code length under some given distribution (assuming iid input messages)!
Proof sketch of the code optimality claim:
- Kraft inequality (for “immediate” or prefix codes; reflects how you would code in practice)
- Kraft-McMillan inequality. This is a generalization of the above for all non-singular codes. It’s interesting in that it rules out the possibility of improving on optimality by using some convoluted coding strategy (e.g., at the expense of runtime complexity).
- Gibb’s inequality (will include a self-contained and simple proof)
- Tying it all together: Gibbs inequality with Kraft-McMillan to constrain targets to a probabililty distribution.
Proof sketch of optimality for differential (continuous) entropy: same as above, but substitute Jensen’s inequality for Gibbs. I will not be proving Jensen’s inequality here as it requires more machinery, but I can give a sketch of how you would apply it as above.
Mutual information (definition and interpretation in terms of other operations we know)
Joint entropy. This is not a particularly useful notion, but I’m calling it out to avoid confusion around terminology. This should be distinguished from conditional entropy, cross entropy, and “relative entropy” (a little-used term for KL divergence).

“Independent and identically distributed”, in the same sense as a probabilistic sample. Note that this does not imply a uniform distribution over symbols of your source alphabet, but that every “character” (or symbol) sampled from that alphabet has some well known probability mass which does not change with its context. You can generalize the range of codes covered by this abstraction by taking $n$ -grams instead of individual symbols. ↩
This is a limitation worth noting because it is generally not true of “real” codes, which is why you can sometimes get better compression than raw entropy would suggest using domain-specific compression techniques or lossy compression. For example, human language is particularly compressible, especially if you allow lossy encoding. ↩

Meetup summary

2026-04-24 - Information theory - optimal coding

Recommended reading:

Agenda:

Footnotes