Artificial intelligence, cats, and space.

For a while now I have been advocating for tuning ε in various parts of the modern deep learning stack, and in this post I’ll explain why.

What do I mean by ε?

The fifth letter in the Greek alphabet, ε is usually a nuisance parameter in many modern deep learning techniques. Often inserted in a denominator to avoid division by zero, it is a very small positive constant that is almost always left unchanged, or if it is changed then this fact is not widely appreciated.

This post was motivated by a recent Twitter thread where many people were (understandably) surprised at the large change in performance when changing ε from 10-7 to 10-5 in RMSProp in an A2C RL experiment, but this wasn't the first social media post to wonder about this often forgotten parameter.

A comparison of preconditioners, from [15].

In each the optimizers listed above, the preconditioner being applied is only an approximation to the true curvature of the loss surface we are optimizing (and are usually derived via second-order Taylor expansions around the current loss). As expected, the approximation will become less accurate the further away we move from the current value of the parameters, and if the approximation is very bad then we may not have to move very far for it to break down. Ideally, the preconditioning would be weaker when we are less confident in its accuracy; luckily, ε gives us a way to control this! Simply put, the larger ε is, the smaller the effect of the preconditioner on our optimizer's update. However, we want to also keep ε, sometimes referred to as damping in this context, as small as possible, because as mentioned above it will get rid of low curvature (small eigenvalue) directions in the preconditioning that are useful to consider when speeding up optimization.

One way to strike a balance between keeping ε small enough to keep low curavture directions around, but also large when we are not confident in our preconditioner approximation, is to change ε during optimization. One way to achieve this is via a heuristic used in the Levenberg-Marquardt algorithm. Put simply, we calculate how much our loss function changed between two updates, and compare this to how much a second-order Taylor expansion of the loss predicted it should change, and if the two are in agreement then we can have confidence in our preconditioner (note that we can use Hessian-vector products to compute the second order term in our Taylor expansion). Each time we calculate this, we increase or decrease the damping (ε) term by a multiplicative factor; see section 6 of the KFAC paper [13] for a more detailed explanation. In the beginning of training, our approximations to the loss surface curvature will likely be very bad: we have not yet warmed up our gradient exponential moving averages used in calculating it, and the loss surface characteristic are changing more rapidly then they will be when we are closer to the optimum, which we usually assume is a well behaved quadratic. Conversely, we want to have the preconditioner have access to as many low curvature directions as possible towards the end of optimization, so we can more accurately pinpoint the bottom of the optimum. While it is not guaranteed that ε will monotonically decrease throughout training with the scheme described above, in my experience on common benchmark problems (ResNets on CIFAR/ImageNet) this rule almost always decreases ε. Therefore I would actually recommend using an exponential decay schedule for ε instead of a fixed value, where the final value is extremely close to zero.

An analysis of the effect of damping strength in KFAC on an MNIST autoencoder, from [13].

One reason why the effects of ε may not have been noticed much in popular optimizers such as Adam and RMSProp is because they use a square root around their second moment estimates of the gradients, which can be viewed as a non-linear version of damping; once the numerics of dividing by zero have been solved by ε, the square root can handle diminishinig large dimensions and increasing small ones, which will have a similar normalizing effect on the eigenvalues. In fact, if you are feeling adventurous, tuning the exponent in Adam and RMSProp to be something other than 1/2 could possibly yield additional improvements.

Finally, for some experiments there exists an interesting relationship betweeen ε and weight decay, as thoroughly outlined in [14]. Put simply, under various conditions discussed in [14], the scale of the weights affects the scale of the gradients which affects our preconditioning matrix. Thus, the scale of the weights will affect the balance between our preconditioner matrix and our damping term ε, and so weight decay can interact with ε through this process. Therefore, it may be useful to jointly tune of ε and weight decay, beyond the effect that they both have on the effective learning rate of an update.

Batch Norm

Demonstrating the effect of ε in Batch Norm on model calibration, from [17].

Batch Normalization is an ever-present and ever-annoying improvement to many deep learning pipelines. It too has an ε parameter, which is used in the denominator of its normalization to avoid dividing by a zero-valued variance. One key thing to note is that Batch Norm is oftentimes applied channelwise to convolutional layers, and so by increasing the ε in the denominator we are effectively temperature scaling/smoothing out the differences in magnitude across activation channels. In addition to these potential normalization benefits which are similar to Local Response Normalization[16], this could also have an effect on the entropy of the model's output distribution. In many architectures, the Batch Norm normalized channel activations are averaged into a final vector that is put through a linear layer to obtain the model logits; therefore, if the activations going into the final average pooling layer are closer together in value, the final linear layer will have to explicitly learn to make them separate in value again. While this could happen, we did see in [17] that increasing ε in Batch Norm improved model calibration, as measured by Expected Calibration Error, especially on CIFAR10 (before destroying model accuracy.)

Implementation differences

It is also interesting to look at how different implementations of (supposedly) the same algorithms choose different default values for ε.

The TensorFlow documentation for Adam is the only one to mention that using default ε values may not always be the best choice: "The default value of 1e-7 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.". However, we should note that the default value is 10-7 for tf.keras.optimizers.Adam and 10-8 for tf.train.AdamOptimizer, PyTorch and Flax.

Regarding Batch Normalization, the TensorFlow layer uses a default ε = 10-3 whereas Flax and PyTorch both use default ε = 10-5. Note that often times, ImageNet/ResNet-50 pipelines will explicitly specify a Batch Norm ε = 10-5.

Another potential framework difference is whether or not epsilon is added before or after taking the square root of the diagonal Fisher. This has been previously pointed out for RMSProp in TF vs PyTorch, where PyTorch adds ε after the square root and TF adds it inside the square root, but it should also be noted that if momentum is not used in RMSProp in TF then ε is added outside the square root again! Hinton's course slides that introduced RMSProp did not specify anything about ε.

There also exist differences in ε defaults for AdaGrad, as originally pointed out in Appendix B of [11]; tf.keras.optimizers.Adagrad defaults to ε = 10-7, while tf.train.AdagradOptimizer defaults to ε = 0.1. In PyTorch the default ε is 10-10, and Flax uses a default ε = 10-8.

Previous works

It is actually more common to use non-default values than many realize (many references taken from our paper[1]), as seen in the following previous works:

• [2] considers ε = 10-3 and note that it can be viewed as an adaptivity parameter to diminish the effect of the velocity in Adam.
• [3] (which introduces RAdam) considers ε = 10-4 to reduce the variance of the adaptive learning rate of Adam.
• [4] uses ε = 10-6 in Adam for pretraining their RoBERTa models.
• [5] includes an analysis of Adam and RMSProp hyperparameters and finds that they can both be sensitive to ε.
• [6] (which introduces Inception-v2) uses an ε = 1.0 for RMSProp to train their models.
• [7] (which introduces the Rainbow RL technique) uses an ε = 1.5 x 10-4 for Adam to train their models.
• [8, 9] (MnasNet, EfficientNet) use ε = 10-3 in RMSProp but do not mention so in the papers, but can be found in their code.
• the original DQN code from DeepMind uses ε = 0.01 in RMSProp.
• [10] (learning rate grafting) sweeps over ε values in [0, 1] for AdaGrad and finds that it can have a noticeable impact on training performance.

An analysis of the effect of ε in AdaGrad on a Transformer, from [10].