Dropout - Dilution - DropConnect (ANN)
Dropout - Dilution - DropConnect
- is a regularization method for artificial neural networks that prevents them from overfitting
- the key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.
- introduced by Hinton et al. in 2012
Dropout Introduction
Deep learning neural networks are likely to quickly overfit a training dataset with few examples. Ensembles of neural networks with different model configurations are known to reduce overfitting but require the additional computational expense of training and maintaining multiple models. A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and improve generalization error in deep neural networks of all kinds. Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel In the simplest case, each unit is retained with a fixed probability 𝑝 independent of other units, where 𝑝 can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5. Dropout is not used after training when predicting with the fit network. If a unit is retained with probability 𝑝 during training, the outgoing weights of that unit are multiplied by 𝑝 at test time. |
Dropout Neural Network Model Description
Consider a neural network with 𝐿 hidden layers. Let:
- 𝑙∈{1, ..., 𝐿} index the hidden layers of the network
- 𝐳(𝑙) is the vector input into layer 𝑙
- 𝐲(𝑙) is the vector output from layer 𝑙 (𝐲(0) = 𝐱 is the input)
- 𝑊(𝑙) and 𝑏(𝑙) are the weights and biases at layer 𝑙
Standard Feed-Forward Operation | Feed-Forward Operation with Dopout |
---|---|
| |
Dropout Effect on Features
Dropout Effect on Sparsity
In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low.
Comparisons with Other Regularization Methods
- Dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints, and sparse activity regularization
- Dropout may also be combined with other forms of regularization to yield further improvement
- Although dropout alone gives significant improvements, using dropout along with max norm regularization, large decaying learning rates, and high momentum provides a significant
boost over just using dropout.