Maximum Entropy (Maxent) Models

Maximum Entropy (Maxent) Models

Maximum Entropy (Maxent) Models

Maxent Model - Example

let's consider a discrete random variable 𝐶 with 2 outcomes: ℎ and 𝑡

  • 𝐏(𝐶=ℎ) = probability of seeing heads
  • 𝐏(𝐶=𝑡) = probability of seeing tails

below is the formula for univariate entropy, in which we want to maximize 𝐻𝐏(𝐏) with respect to the constraints of the model

  • 𝐻𝐏(𝐏) = 𝛴𝑥∊𝐶 [ - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) ]

below are 3 different models

Model With No ConstraintsModel With 1 ConstraintModel With 2 Constraints
NONE
here 𝐏(𝐶) is allowed to be an un-normalized distribution
i.e. 𝐏(𝐶) does not have to be a probability distribution
𝐏(𝐶=ℎ) + 𝐏(𝐶=𝑡) = 1
this constrains 𝐏(𝐶) to be a normalized distribution
i.e. 𝐏(𝐶) is a probability distribution
𝐏(𝐶=ℎ) + 𝐏(𝐶=𝑡) = 1
𝐏(𝐶=ℎ) = 0.3
thus there is a 2D plane of possible candidatesthus there is a 1D line of possible candidatesthus there is a single 1D point as the possible candidate

𝐻𝐏(𝐏) is maximized when:
𝐏(𝐶=ℎ) = 1/𝑒
𝐏(𝐶=𝑡) = 1/𝑒

this is because the max of -𝐏(𝐶=𝑥)𝑙𝑛𝐏(𝐶=𝑥) is 1/𝑒

𝐻𝐏(𝐏) is maximized when:
𝐏(𝐶=ℎ) = 1/2
𝐏(𝐶=𝑡) = 1/2

𝐻𝐏(𝐏) is maximized when:
𝐏(𝐶=ℎ) = 0.3
𝐏(𝐶=𝑡) = 0.7

which is the only candidate point

Why Find Maximum Entropy Model?

maximizing entropy in effect helps us find an estimated distribution model 𝐏ˆ that:

this is what we want in the estimated distribution model 𝐏ˆ

Solution

is to maximize entropy 𝐻, subject to feature-based constraints:

  • 𝐄𝐏[𝑓𝑖] = 𝐄𝐏ˆ[𝑓𝑖] ↔ 𝛴𝑥∊𝑓𝑖𝐏𝑥 = 𝐶𝑖

adding constraints/features:

  • lowers maximum entropy
  • raises the maximum likelihood of data
  • brings the distribution model further from the uniform distribution
  • brings the distribution model closer to the empirical distribution

Maxent - Properties

 maximum entropy models are convex

a model 𝐹 is convex when:

  • 𝐹(𝛴𝑖𝑤𝑖𝑥𝑖) ≥  𝛴𝑖𝑤𝑖𝐹(𝑥𝑖) where 𝛴𝑖𝑤𝑖 = 1

convexity guarantees a single, global maximum because any higher points are greedily reachable

maximum entropy models 𝐻𝐏(𝐏) = 𝛴𝑥∊𝐶 [ - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) ] are convex

  • - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) is convex
  • 𝛴𝑥∊𝐶 [ - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) ] is convex (sum of convex functions is convex)
  • the feasible-region of constrained 𝐻𝐏(𝐏) is a linear subspace that is convex
  • the constrained entropy surface is therefore convex

the Maximum Likelihood Estimation (MLE) exponential model formulation is also convex (dual)

Subpages

Resources