ML - Generative/Joint vs Discriminative/Conditional Models

ML - Generative/Joint vs Discriminative/Conditional Models


Discriminative Model (Conditional Model)

Generative Model (Joint Model)

Common

is either:
classifier - 𝑌 is a categorical variable
regressor - 𝑌 is a scalar variable

Goaldirectly estimate/learns the conditional probability distribution 𝐏(𝑌|𝑋)estimate/learns the joint probability distribution 𝐏(𝑋,𝑌)
What It's Used For

the estimated distribution 𝐏(𝑌|𝑋) can be used for:

  • estimating/predicting the value of 𝑌 when given 𝑋

the estimated distribution 𝐏(𝑋,𝑌) can be used for:

  • estimating/predicting the value of 𝑌 when given 𝑋, by transforming 𝐏(𝑋,𝑌) into 𝐏(𝑌|𝑋) via Bayes rule
  • generate likely (𝑥,𝑦) pairs
Other
  • do not encode the distribution over 𝑋
  • since 𝑋 is not encoded, no independence assumptions are made within 𝑋. Thus discriminative models have lower bias than generative models that makes these independence assumptions
  • cannot be used to reach any conclusions about 𝑋
  • encodes the distribution over 𝑋
  • tend to encode independence assumptions within 𝑋 in order to lower the model complexity. Thus, these generative models have higher bias than discriminative models
  • better able to deal with missing values and unlabeled data
In Relation to Amount of Data
  • the additional independence assumption bias in a generative model can help regularize and constrain the model, thereby reducing its ability to overfit the data. (therefore, generative training often works better when we are learning from limited amounts of data)
  • as amount of data grows, the bias imposed by constraints starts to dominate the error of our model. Therefore discriminative models (which make fewer assumptions) will tend to be less affected by incorrect model assumptions and will often outperform the generatively trained models for larger data sets
Classification Problems

discriminative classifier model learns the decision boundary

generative classifier model learns the probability distributions

Example Models


Example 1

Consider the problem of identifying letters from handwritten images:

  • the target variable 𝑌 is a categorical variable with cardinality of 26
  • the input variables {𝑋1, ..., 𝑋𝑛} are the 𝑛 pixels of the image

We can then either:

The Naive Bayes Model separately learns the distribution over the 256 pixel values given each of the 26 labels; each of these is estimated independently, giving rise to a set of fairly low-dimensional estimation problems. Conversely, the discriminative model is jointly optimizing all of the approximately 26×256 parameters of the multinomial logit distribution, a much higher-dimensional estimation problem. Thus, for sparse data, the naive Bayes model may often perform better.

However, even in this simple setting, the independence assumption made by the Naive Bayes Model — that pixels are independent given the image label — is clearly false. As a consequence, the Naive Bayes Model may be counting, as independent, features that are actually correlated, leading to errors in the estimation. The discriminative model is not making these assumptions; by fitting the parameters jointly, it can compensate for redundancy and other correlations between the features. Thus, as we get enough data to fit the logistic model reasonably well, we would expect it to perform better

Generative & Discriminative

Example 2

suppose you have the following data in the form (𝑥,𝑦):

  • (1,0)
  • (1,0)
  • (2,0)

  • (2,1)

Generative
𝐏(𝑋,𝑌)

Discriminative
𝐏(𝑌|𝑋)
𝑌=0𝑌=1𝑌=0𝑌=0
𝑋=11/2010
𝑋=21/41/41/21/2

What do joint models 𝐏(𝑋,𝑌) have to do with conditional models 𝐏(𝑌|𝑋)?

think of the space 𝑋×𝑌 as a cartesian product, where:

  • 𝑋 is generally huge (e.g. space of documents)
  • 𝑌 is generally small (e.g. 2-100 topic classes)

we can build models over 𝐏(𝑋,𝑌) by calculating expectations of features over 𝑋×𝑌:

  • 𝐄[𝑓𝑖] = 𝛴(𝑥,𝑦)∊(𝑋,𝑌)[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓𝑖(𝑥,𝑦) ]

However, this is impractical as we cannot enumerate 𝑋 efficiently

𝑋 may be huge or infinite, but only a few 𝑥 occur in our data

Therefore, we can add one feature for each 𝑥 and constrain its expectation to match our empirical data:

  • ∀𝑥∊𝑋: 𝐏(𝑋=𝑥) = 𝐏ˆ(𝑋=𝑥)

now most entries of 𝐏(𝑥,𝑦) will be zero

we can therefore use the much easier sum:

  • 𝐄[𝑓𝑖] = 𝛴(𝑥,𝑦)∊(𝑋,𝑌)[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓𝑖(𝑥,𝑦) ]
  • 𝐄[𝑓𝑖] = 𝛴(𝑥,𝑦)∊(𝑋,𝑌) & 𝐏ˆ(𝑋=𝑥)>0[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓𝑖(𝑥,𝑦) ]

since we've constrained the 𝑋 marginals, then the only thing that can vary is the conditional distributions:

  • 𝐏(𝑋=𝑥,𝑌=𝑦) = 𝐏(𝑌=𝑦|𝑋=𝑥)𝐏(𝑋=𝑥) # via chain rule
  • 𝐏(𝑋=𝑥,𝑌=𝑦) = 𝐏(𝑌=𝑦|𝑋=𝑥)𝐏ˆ(𝑋=𝑥) # our constraint