Zack Nado

Artificial intelligence, cats, and space.

About

For a while now I have been advocating for tuning ε in various parts of the modern deep learning stack, and in this post I’ll explain why.

What do I mean by ε?

The fifth letter in the Greek alphabet, ε is usually a nuisance parameter in many modern deep learning techniques. Often inserted in a denominator to avoid division by zero, it is a very small positive constant that is almost always left unchanged, or if it is changed then this fact is not widely appreciated.

This post was motivated by a recent Twitter thread where many people were (understandably) surprised at the large change in performance when changing ε from 10-7 to 10-5 in RMSProp in an A2C RL experiment, but this wasn't the first social media post to wonder about this often forgotten parameter.

Adam, RMSProp, Ada*

In many preconditioned optimizers (Adam, RMSProp, AdaGrad, K-FAC, Shampoo, etc.), there is a multiple of the identity matrix added to the preconditioning matrix before inverting it. In Adam and RMSProp (and the too many variations of them that have recently come up, see Appendix A of [12]), this preconditioning matrix is (approximately) the diagonal of the empirical Fisher matrix (see [15] for a great analysis of the empirical vs true Fishers), and so this ε hyperparameter is just added elementwise to the second moment estimates. Often times this is done to avoid numerical issues with inverting (dividing by) the preconditioner. This works because we know the minimum eigenvalue of the matrix (P + εI) is the mininum eigenvalue of P plus ε, so as long as ε is larger than any negative eigenvales of P then we will have a positive definite matrix which we can invert (the same logic applies with diagonal preconditioners, just with division by zero instead of matrix inversion). See [19] for a discussion of the numerical challenges involved in preconditioned optimization. In addition to this very useful feature, we can also decompose ε into serving another purpose as part of a trust region radius on the optimizer updates (the actual trust region radius depends on both ε and our preconditioner).

A comparison of preconditioners, from [15].

In each the optimizers listed above, the preconditioner being applied is only an approximation to the true curvature of the loss surface we are optimizing (and are usually derived via second-order Taylor expansions around the current loss). As expected, the approximation will become less accurate the further away we move from the current value of the parameters, and if the approximation is very bad then we may not have to move very far for it to break down. Ideally, the preconditioning would be weaker when we are less confident in its accuracy; luckily, ε gives us a way to control this! Simply put, the larger ε is, the smaller the effect of the preconditioner on our optimizer's update. However, we want to also keep ε, sometimes referred to as damping in this context, as small as possible, because as mentioned above it will get rid of low curvature (small eigenvalue) directions in the preconditioning that are useful to consider when speeding up optimization.

One way to strike a balance between keeping ε small enough to keep low curavture directions around, but also large when we are not confident in our preconditioner approximation, is to change ε during optimization. One way to achieve this is via a heuristic used in the Levenberg-Marquardt algorithm. Put simply, we calculate how much our loss function changed between two updates, and compare this to how much a second-order Taylor expansion of the loss predicted it should change, and if the two are in agreement then we can have confidence in our preconditioner (note that we can use Hessian-vector products to compute the second order term in our Taylor expansion). Each time we calculate this, we increase or decrease the damping (ε) term by a multiplicative factor; see section 6 of the KFAC paper [13] for a more detailed explanation. In the beginning of training, our approximations to the loss surface curvature will likely be very bad: we have not yet warmed up our gradient exponential moving averages used in calculating it, and the loss surface characteristic are changing more rapidly then they will be when we are closer to the optimum, which we usually assume is a well behaved quadratic. Conversely, we want to have the preconditioner have access to as many low curvature directions as possible towards the end of optimization, so we can more accurately pinpoint the bottom of the optimum. While it is not guaranteed that ε will monotonically decrease throughout training with the scheme described above, in my experience on common benchmark problems (ResNets on CIFAR/ImageNet) this rule almost always decreases ε. Therefore I would actually recommend using an exponential decay schedule for ε instead of a fixed value, where the final value is extremely close to zero.

An analysis of the effect of damping strength in KFAC on an MNIST autoencoder, from [13].

One reason why the effects of ε may not have been noticed much in popular optimizers such as Adam and RMSProp is because they use a square root around their second moment estimates of the gradients, which can be viewed as a non-linear version of damping; once the numerics of dividing by zero have been solved by ε, the square root can handle diminishinig large dimensions and increasing small ones, which will have a similar normalizing effect on the eigenvalues. In fact, if you are feeling adventurous, tuning the exponent in Adam and RMSProp to be something other than 1/2 could possibly yield additional improvements.

Finally, for some experiments there exists an interesting relationship betweeen ε and weight decay, as thoroughly outlined in [14]. Put simply, under various conditions discussed in [14], the scale of the weights affects the scale of the gradients which affects our preconditioning matrix. Thus, the scale of the weights will affect the balance between our preconditioner matrix and our damping term ε, and so weight decay can interact with ε through this process. Therefore, it may be useful to jointly tune of ε and weight decay, beyond the effect that they both have on the effective learning rate of an update.

Batch Norm

Demonstrating the effect of ε in Batch Norm on model calibration, from [17].

Batch Normalization is an ever-present and ever-annoying improvement to many deep learning pipelines. It too has an ε parameter, which is used in the denominator of its normalization to avoid dividing by a zero-valued variance. One key thing to note is that Batch Norm is oftentimes applied channelwise to convolutional layers, and so by increasing the ε in the denominator we are effectively temperature scaling/smoothing out the differences in magnitude across activation channels. In addition to these potential normalization benefits which are similar to Local Response Normalization[16], this could also have an effect on the entropy of the model's output distribution. In many architectures, the Batch Norm normalized channel activations are averaged into a final vector that is put through a linear layer to obtain the model logits; therefore, if the activations going into the final average pooling layer are closer together in value, the final linear layer will have to explicitly learn to make them separate in value again. While this could happen, we did see in [17] that increasing ε in Batch Norm improved model calibration, as measured by Expected Calibration Error, especially on CIFAR10 (before destroying model accuracy.)

Implementation differences

It is also interesting to look at how different implementations of (supposedly) the same algorithms choose different default values for ε.

The TensorFlow documentation for Adam is the only one to mention that using default ε values may not always be the best choice: "The default value of 1e-7 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.". However, we should note that the default value is 10-7 for tf.keras.optimizers.Adam and 10-8 for tf.train.AdamOptimizer, PyTorch and Flax.

Regarding Batch Normalization, the TensorFlow layer uses a default ε = 10-3 whereas Flax and PyTorch both use default ε = 10-5. Note that often times, ImageNet/ResNet-50 pipelines will explicitly specify a Batch Norm ε = 10-5.

Another potential framework difference is whether or not epsilon is added before or after taking the square root of the diagonal Fisher. This has been previously pointed out for RMSProp in TF vs PyTorch, where PyTorch adds ε after the square root and TF adds it inside the square root, but it should also be noted that if momentum is not used in RMSProp in TF then ε is added outside the square root again! Hinton's course slides that introduced RMSProp did not specify anything about ε.

There also exist differences in ε defaults for AdaGrad, as originally pointed out in Appendix B of [11]; tf.keras.optimizers.Adagrad defaults to ε = 10-7, while tf.train.AdagradOptimizer defaults to ε = 0.1. In PyTorch the default ε is 10-10, and Flax uses a default ε = 10-8.

Previous works

It is actually more common to use non-default values than many realize (many references taken from our paper[1]), as seen in the following previous works:

An analysis of the effect of ε in AdaGrad on a Transformer, from [10].

Parting advice

As models are getting larger and larger, hyperparameter tuning is getting more and more expensive. While I agree with a lot of the existing literature[18] on tuning the learning rate and learning rate schedule first, I also believe that if you can afford it, when tuning optimizers trying one or two values of ε that is a few orders of magnitude larger than the default can be useful. For example, I would try both the default values and ε = 10-3 for Adam or RMSProp in your searches. If possible, an exponential decay from the initial value to some value very very close to zero could provide even more benefits!

If you can afford even more trials, I would also recommend tuning ε on a floating point or grid search log scale. Even better, jointly tuning the learning rate and/or weight decay with ε, as we did in [1], can help take into account the ways which the hyperparameters affect one another.

While not as impactful, one can still improve model performance by tuning the Batch Norm ε, namely increasing it can lead to slightly improved model calibration (although I would recommend many other calibration improvement techniques first).

In the end, I believe all of this evidence goes to show that optimizer hyperparameter tuning can have dramatic effects on model performance, and so it should always be done with care and deliberation (and more importantly, always detailed in any resulting papers!)

References