Dropout - Dilution - DropConnect

is a regularization method for artificial neural networks that prevents them from overfitting
the key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.
introduced by Hinton et al. in 2012

Dropout Introduction

Deep learning neural networks are likely to quickly overfit a training dataset with few examples.

Ensembles of neural networks with different model configurations are known to reduce overfitting but require the additional computational expense of training and maintaining multiple models.

A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and improve generalization error in deep neural networks of all kinds.

Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel

In the simplest case, each unit is retained with a fixed probability 𝑝 independent of other units, where 𝑝 can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5.

Dropout is not used after training when predicting with the fit network.

If a unit is retained with probability 𝑝 during training, the outgoing weights of that unit are multiplied by 𝑝 at test time.

Dropout Neural Network Model Description

Consider a neural network with 𝐿 hidden layers. Let:

𝑙∈{1, ..., 𝐿} index the hidden layers of the network
𝐳^(𝑙) is the vector input into layer 𝑙
𝐲^(𝑙) is the vector output from layer 𝑙 (𝐲⁽⁰⁾ = 𝐱 is the input)
𝑊^(𝑙) and 𝑏^(𝑙) are the weights and biases at layer 𝑙

Standard Feed-Forward Operation	Feed-Forward Operation with Dopout
Loading Loading where: 𝑓 is any activation function	Loading Loading Loading Loading

Dropout Effect on Features

Dropout Effect on Sparsity

In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low.

Comparisons with Other Regularization Methods

Dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints, and sparse activity regularization
Dropout may also be combined with other forms of regularization to yield further improvement
Although dropout alone gives significant improvements, using dropout along with max norm regularization, large decaying learning rates, and high momentum provides a significant
boost over just using dropout.

Resources

https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf

Dropout - Dilution - DropConnect (ANN)