ML - Generative/Joint vs Discriminative/Conditional Models
- Andrew NG's Paper: discriminative-vs-generative-classifiers.pdf
Discriminative Model (Conditional Model) | Generative Model (Joint Model) | |
---|---|---|
Common | is either: | |
Goal | directly estimate/learns the conditional probability distribution 𝐏(𝑌|𝑋) | estimate/learns the joint probability distribution 𝐏(𝑋,𝑌) |
What It's Used For | the estimated distribution 𝐏(𝑌|𝑋) can be used for:
| the estimated distribution 𝐏(𝑋,𝑌) can be used for:
|
Other |
|
|
In Relation to Amount of Data |
| |
Classification Problems | discriminative classifier model learns the decision boundary | generative classifier model learns the probability distributions |
Example Models |
Example 1
Consider the problem of identifying letters from handwritten images:
- the target variable 𝑌 is a categorical variable with cardinality of 26
- the input variables {𝑋1, ..., 𝑋𝑛} are the 𝑛 pixels of the image
We can then either:
- generatively train a Naive Bayes Model
- discriminatively train a Logistic Regression Model
The Naive Bayes Model separately learns the distribution over the 256 pixel values given each of the 26 labels; each of these is estimated independently, giving rise to a set of fairly low-dimensional estimation problems. Conversely, the discriminative model is jointly optimizing all of the approximately 26×256 parameters of the multinomial logit distribution, a much higher-dimensional estimation problem. Thus, for sparse data, the naive Bayes model may often perform better.
However, even in this simple setting, the independence assumption made by the Naive Bayes Model — that pixels are independent given the image label — is clearly false. As a consequence, the Naive Bayes Model may be counting, as independent, features that are actually correlated, leading to errors in the estimation. The discriminative model is not making these assumptions; by fitting the parameters jointly, it can compensate for redundancy and other correlations between the features. Thus, as we get enough data to fit the logistic model reasonably well, we would expect it to perform better
Generative & Discriminative
Example 2
suppose you have the following data in the form (𝑥,𝑦):
- (1,0)
- (1,0)
(2,0)
- (2,1)
| Generative |
| |||
---|---|---|---|---|---|
𝑌=0 | 𝑌=1 | 𝑌=0 | 𝑌=0 | ||
𝑋=1 | 1/2 | 0 | 1 | 0 | |
𝑋=2 | 1/4 | 1/4 | 1/2 | 1/2 |
What do joint models 𝐏(𝑋,𝑌) have to do with conditional models 𝐏(𝑌|𝑋)?
think of the space 𝑋×𝑌 as a cartesian product, where:
- 𝑋 is generally huge (e.g. space of documents)
- 𝑌 is generally small (e.g. 2-100 topic classes)
we can build models over 𝐏(𝑋,𝑌) by calculating expectations of features over 𝑋×𝑌:
- 𝐄[𝑓𝑖] = 𝛴(𝑥,𝑦)∊(𝑋,𝑌)[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓𝑖(𝑥,𝑦) ]
However, this is impractical as we cannot enumerate 𝑋 efficiently
𝑋 may be huge or infinite, but only a few 𝑥 occur in our data
Therefore, we can add one feature for each 𝑥 and constrain its expectation to match our empirical data:
- ∀𝑥∊𝑋: 𝐏(𝑋=𝑥) = 𝐏ˆ(𝑋=𝑥)
now most entries of 𝐏(𝑥,𝑦) will be zero
we can therefore use the much easier sum:
- 𝐄[𝑓𝑖] = 𝛴(𝑥,𝑦)∊(𝑋,𝑌)[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓𝑖(𝑥,𝑦) ]
- 𝐄[𝑓𝑖] = 𝛴(𝑥,𝑦)∊(𝑋,𝑌) & 𝐏ˆ(𝑋=𝑥)>0[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓𝑖(𝑥,𝑦) ]
since we've constrained the 𝑋 marginals, then the only thing that can vary is the conditional distributions:
- 𝐏(𝑋=𝑥,𝑌=𝑦) = 𝐏(𝑌=𝑦|𝑋=𝑥)𝐏(𝑋=𝑥) # via chain rule
- 𝐏(𝑋=𝑥,𝑌=𝑦) = 𝐏(𝑌=𝑦|𝑋=𝑥)𝐏ˆ(𝑋=𝑥) # our constraint