HomeTechnologyArtificial intelligenceNeurIPS 2022 | MIT & Meta Enable Gradient Descent Optimizers to Automatically...

NeurIPS 2022 | MIT & Meta Enable Gradient Descent Optimizers to Automatically Tune Their Own Hyperparameters

Most deep neural network training relies heavily on gradient descent, but choosing the optimal step size for an optimizer is challenging because it involves tedious and error-prone manual work.

In the NeurIPS 2022 Outstanding Paper Gradient Descent: The Ultimate Optimizationresearchers at MIT CSAIL and Meta present a new technique that allows gradient descent optimizers such as SGD and Adam to automatically tune their hyperparameters. The method does not require manual differentiation and can be stacked recursively on many levels.

The team addresses the limitations of previous gradient descent optimizers by enabling automatic differentiation (AD), which offers three key benefits:

  1. AD automatically calculates correct derivatives without any extra human effort.
  2. Of course, it generalizes to other hyperparameters (e.g. momentum coefficient) for free.
  3. AD can be applied to optimize not only the hyper-parameters, but also the hyper-hyper-parameters and the hyper-hyper-hyper-parameters, and so on.

To enable the automatic calculation of hypergradients, the team first “unloads” the weights from the calculation graph before running the next iteration of the gradient-descent algorithm, which converts the weights into graph sheets by removing any incoming edges. This approach prevents the math graph from growing with each step, which would result in quadratic time and tenacious training.

The team also allows the backpropagation process to deposit gradients with respect to both weights and step size by not decoupling the step size from the graph, but instead decoupling the parents. This leads to a fully automated hyper-optimization algorithm.

To enable computational gradients automatically via AD, the researchers recursively feed HyperSGD itself as the optimizer to obtain a next-level hyperoptimizer, HyperSGD. AD can be applied to hyper-parameters, hyper-hyper-parameters, hyper-hyper-hyper-parameters, and so on. As these optimization towers grow in size, they become less sensitive to the initial choice of hyperparameters.

Must Read
AI-based model that predicts extreme wildfire danger

In their empirical study, the team applied their hyper-optimized SGD to popular optimizers such as Adam, AdaGrad, and RMSProp. The results show that the use of hyper-optimized SGD significantly improves base performance.

This work introduces an efficient technique that allows gradient descent optimization programs to automatically tune their own hyperparameters and stack recursively at many levels. A PyTorch implementation of the newspaper AD algorithm is available on the project GitHub.

The newspaper Gradient Descent: The Ultimate Optimization is on OpenReview.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synchronized Global AI Weekly to get weekly AI updates.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments