Dropout - Dilution - DropConnect (ANN)

Dropout - Dilution - DropConnect (ANN)

Dropout - Dilution - DropConnect

Dropout Introduction

Deep learning neural networks are likely to quickly overfit a training dataset with few examples.

Ensembles of neural networks with different model configurations are known to reduce overfitting but require the additional computational expense of training and maintaining multiple models.

A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and improve generalization error in deep neural networks of all kinds.

Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel

In the simplest case, each unit is retained with a fixed probability 𝑝 independent of other units, where 𝑝 can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5.

Dropout is not used after training when predicting with the fit network.

If a unit is retained with probability 𝑝 during training, the outgoing weights of that unit are multiplied by 𝑝 at test time.

Dropout Neural Network Model Description

Consider a neural network with 𝐿 hidden layers. Let:

  • 𝑙∈{1, ..., 𝐿} index the hidden layers of the network
  • 𝐳(𝑙) is the vector input into layer 𝑙
  • 𝐲(𝑙) is the vector output from layer 𝑙 (𝐲(0) = 𝐱 is the input)
  • 𝑊(𝑙) and 𝑏(𝑙) are the weights and biases at layer 𝑙
Standard Feed-Forward OperationFeed-Forward Operation with Dopout
  • Loading

  • Loading

where:

  • Loading

  • Loading

  • Loading

  • Loading

Dropout Effect on Features

Dropout Effect on Sparsity

In a good sparse model, there should only be a few highly activated units for any data case. Moreover, the average activation of any unit across data cases should be low.

Comparisons with Other Regularization Methods

  • Dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints, and sparse activity regularization
  • Dropout may also be combined with other forms of regularization to yield further improvement
  • Although dropout alone gives significant improvements, using dropout along with max norm regularization, large decaying learning rates, and high momentum provides a significant
    boost over just using dropout.

Resources