H(X)=−∑i​p(xi​)logp(xi​)

What it is:

This formula calculates the Shannon entropy, often just called entropy, of a discrete random variable X. In simple terms, entropy measures the average amount of uncertainty or surprise associated with the possible outcomes of that random variable.

Breaking down the components:

  1. H(X): This represents the entropy of the random variable X. It's the value we are calculating.
  2. X: This is a random variable, which means it can take on different possible values or outcomes. Think of tossing a coin (X can be Heads or Tails) or rolling a die (X can be 1, 2, 3, 4, 5, or 6).
  3. xi​: This represents a specific possible outcome or value that the random variable X can take. For a coin toss, x1​ could be Heads and x2​ could be Tails.
  4. p(xi​): This is the probability that the random variable X takes on the specific value xi​. For a fair coin, p(Heads)=0.5 and p(Tails)=0.5.
  5. logp(xi​): This is the logarithm of the probability p(xi​).
    • The base of the logarithm determines the units of entropy.
      • Base 2 (log2​): Units are bits (most common in information theory).
      • Base e (ln): Units are nats.
      • Base 10 (log10​): Units are hartleys or dits.
    • Since probabilities p(xi​) are between 0 and 1, their logarithms (for bases > 1) will be negative or zero.
  6. ∑i​: This is the summation symbol. It means we need to calculate the term p(xi​)logp(xi​) for every possible outcome xi​ of the random variable X, and then add all those terms together.
  7. : The negative sign at the beginning ensures that the final entropy value H(X) is non-negative (since the logp(xi​) terms are non-positive, and we are summing them).

Putting it together:

The term −p(xi​)logp(xi​) quantifies the "surprise" or information content associated with outcome xi​, weighted by how likely that outcome is. Less likely events (small p(xi​)) have a higher "surprise" (large negative logp(xi​)). The formula sums these weighted surprise values across all possible outcomes to give the average surprise or uncertainty of the random variable X.

  • High Entropy: Means high uncertainty. The outcomes are more evenly spread out in probability (like a fair coin).
  • Low Entropy: Means low uncertainty. One or a few outcomes are much more likely than others (like a biased coin that almost always lands heads). The minimum entropy is 0, which occurs when one outcome has a probability of 1 (no uncertainty at all).

Example: Coin Toss

Let X be the outcome of a coin toss. Possible outcomes are Heads (H) and Tails (T). We'll use log2​ for units in bits.

Case 1: Fair Coin

  • Probabilities: p(H)=0.5, p(T)=0.5
  • Calculation: H(X)=−[p(H)log2​p(H)+p(T)log2​p(T)] H(X)=−[0.5log2​(0.5)+0.5log2​(0.5)] H(X)=−[0.5×(−1)+0.5×(−1)] (since log2​(0.5)=log2​(2−1)=−1) H(X)=−[−0.5−0.5] H(X)=−[−1] H(X)=1 bit
  • Interpretation: There is 1 bit of uncertainty associated with a fair coin toss. This is the maximum possible entropy for a variable with two outcomes.

Case 2: Biased Coin (Always Lands Heads)

  • Probabilities: p(H)=1, p(T)=0
  • Calculation: We need to use the fact that limp→0​plogp=0. H(X)=−[p(H)log2​p(H)+p(T)log2​p(T)] H(X)=−[1×log2​(1)+0×log2​(0)] H(X)=−[1×0+0] (since log2​(1)=0 and using the limit for the second term) H(X)=0 bits
  • Interpretation: There is 0 bits of uncertainty. We know the outcome before the toss, so there's no surprise.

Case 3: Slightly Biased Coin

  • Probabilities: p(H)=0.8, p(T)=0.2
  • Calculation: H(X)=−[p(H)log2​p(H)+p(T)log2​p(T)] H(X)=−[0.8log2​(0.8)+0.2log2​(0.2)] H(X)≈−[0.8×(−0.3219)+0.2×(−2.3219)] H(X)≈−[−0.2575−0.4644] H(X)≈−[−0.7219] H(X)≈0.7219 bits
  • Interpretation: There is less uncertainty than the fair coin (1 bit) but more than the completely predictable coin (0 bits).

In essence, the formula provides a way to quantify the average unpredictability of a system or a source of information based on the probabilities of its different states or symbols.

Sources and related content

Comments