Andrew NG's Paper: discriminative-vs-generative-classifiers.pdf

	Discriminative Model (Conditional Model)	Generative Model (Joint Model)
Common	is either: classifier - 𝑌 is a categorical variable regressor - 𝑌 is a scalar variable
Goal	directly estimate/learns the conditional probability distribution 𝐏(𝑌\|𝑋)	estimate/learns the joint probability distribution 𝐏(𝑋,𝑌)
What It's Used For	the estimated distribution 𝐏(𝑌\|𝑋) can be used for: estimating/predicting the value of 𝑌 when given 𝑋	the estimated distribution 𝐏(𝑋,𝑌) can be used for: estimating/predicting the value of 𝑌 when given 𝑋, by transforming 𝐏(𝑋,𝑌) into 𝐏(𝑌\|𝑋) via Bayes rule generate likely (𝑥,𝑦) pairs
Other	do not encode the distribution over 𝑋 since 𝑋 is not encoded, no independence assumptions are made within 𝑋. Thus discriminative models have lower bias than generative models that makes these independence assumptions cannot be used to reach any conclusions about 𝑋	encodes the distribution over 𝑋 tend to encode independence assumptions within 𝑋 in order to lower the model complexity. Thus, these generative models have higher bias than discriminative models better able to deal with missing values and unlabeled data
In Relation to Amount of Data	the additional independence assumption bias in a generative model can help regularize and constrain the model, thereby reducing its ability to overfit the data. (therefore, generative training often works better when we are learning from limited amounts of data) as amount of data grows, the bias imposed by constraints starts to dominate the error of our model. Therefore discriminative models (which make fewer assumptions) will tend to be less affected by incorrect model assumptions and will often outperform the generatively trained models for larger data sets
Classification Problems	discriminative classifier model learns the decision boundary	generative classifier model learns the probability distributions
Example Models	Click here to expand... Ordinary Least Squares (OLS) Regression Logistic (Logit) Regression Model Support Vector Machines (SVM) Conditional Random Fields (CRF)	Click here to expand... Generalized Discriminant/Discriminative Analysis (GDA) Naive Bayes Model - Bayes Model Bayesian Networks (BN) Markov Random Fields (MRF) Autoencoders (AE) Generative Adversarial Networks (GAN) Flow-Based Generative Model Auto-Regression/Regressor/Regressive (AR) Model Diffusion Model - Diffusion Probabilistic Model - Score-Based Generative Model

Example 1

Consider the problem of identifying letters from handwritten images:

the target variable 𝑌 is a categorical variable with cardinality of 26
the input variables {𝑋₁, ..., 𝑋_𝑛} are the 𝑛 pixels of the image

We can then either:

generatively train a Naive Bayes Model
discriminatively train a Logistic Regression Model

The Naive Bayes Model separately learns the distribution over the 256 pixel values given each of the 26 labels; each of these is estimated independently, giving rise to a set of fairly low-dimensional estimation problems. Conversely, the discriminative model is jointly optimizing all of the approximately 26×256 parameters of the multinomial logit distribution, a much higher-dimensional estimation problem. Thus, for sparse data, the naive Bayes model may often perform better.

However, even in this simple setting, the independence assumption made by the Naive Bayes Model — that pixels are independent given the image label — is clearly false. As a consequence, the Naive Bayes Model may be counting, as independent, features that are actually correlated, leading to errors in the estimation. The discriminative model is not making these assumptions; by fitting the parameters jointly, it can compensate for redundancy and other correlations between the features. Thus, as we get enough data to fit the logistic model reasonably well, we would expect it to perform better

Generative & Discriminative

Example 2

suppose you have the following data in the form (𝑥,𝑦):

(1,0)
(1,0)
(2,0)
(2,1)

	`Generative 𝐏(𝑋,𝑌)`		`Discriminative 𝐏(𝑌\|𝑋)`
	𝑌=0	𝑌=1	𝑌=0	𝑌=0
𝑋=1	1/2	0	1	0
𝑋=2	1/4	1/4	1/2	1/2

What do joint models 𝐏(𝑋,𝑌) have to do with conditional models 𝐏(𝑌|𝑋)?

think of the space 𝑋×𝑌 as a cartesian product, where:

𝑋 is generally huge (e.g. space of documents)
𝑌 is generally small (e.g. 2-100 topic classes)

we can build models over 𝐏(𝑋,𝑌) by calculating expectations of features over 𝑋×𝑌:

𝐄[𝑓_𝑖] = 𝛴_{(𝑥,𝑦)∊(𝑋,𝑌)}[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓_𝑖(𝑥,𝑦) ]

However, this is impractical as we cannot enumerate 𝑋 efficiently

𝑋 may be huge or infinite, but only a few 𝑥 occur in our data

Therefore, we can add one feature for each 𝑥 and constrain its expectation to match our empirical data:

∀𝑥∊𝑋: 𝐏(𝑋=𝑥) = 𝐏ˆ(𝑋=𝑥)

now most entries of 𝐏(𝑥,𝑦) will be zero

we can therefore use the much easier sum:

𝐄[𝑓_𝑖] = 𝛴_{(𝑥,𝑦)∊(𝑋,𝑌)}[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓_𝑖(𝑥,𝑦) ]
𝐄[𝑓_𝑖] = 𝛴_{(𝑥,𝑦)∊(𝑋,𝑌) & 𝐏ˆ(𝑋=𝑥)>0}[ 𝐏(𝑋=𝑥,𝑌=𝑦)𝑓_𝑖(𝑥,𝑦) ]

since we've constrained the 𝑋 marginals, then the only thing that can vary is the conditional distributions:

𝐏(𝑋=𝑥,𝑌=𝑦) = 𝐏(𝑌=𝑦|𝑋=𝑥)𝐏(𝑋=𝑥) # via chain rule
𝐏(𝑋=𝑥,𝑌=𝑦) = 𝐏(𝑌=𝑦|𝑋=𝑥)𝐏ˆ(𝑋=𝑥) # our constraint

ML - Generative/Joint vs Discriminative/Conditional Models

Example 1

Generative & Discriminative

Example 2