GPipe: Google open-sources a library for training huge deep neural networks
Google’s AI (artificial intelligence) research division yesterday open-sourced GPipe for “efficiently” training large-scale neural network models.
For those unaware, GPipe is a scalable pipeline parallelism library that enables learning of giant deep neural networks. It partitions network layers across accelerators and pipelines execution to achieve high hardware utilization. It leverages recomputation to minimize activation memory usage.
The core GPipe library has been open sourced under Lingvo, a TensorFlow framework for sequence modeling. GPipe can not only be applied to any network consisting of multiple sequential layers but it also allows researchers to easily deploy more accelerators to train larger models and scale performance without tuning hyperparameters.
“Deep neural networks (DNNs) have advanced many machine learning tasks, including speech recognition, visual recognition, and language processing. [E]ver-larger DNN models lead to better task performance and past progress in visual recognition tasks has also shown a strong correlation between the model size and classification accuracy,” Google AI software engineer Yanping Huang said in a blog post. “[In] GPipe … we demonstrate the use of pipeline parallelism to scale up DNN training to overcome this limitation.”
According to the accompanying paper, “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,” Huang and his colleagues explain that GPipe applies two nifty AI training methods.
The first method is synchronous stochastic gradient descent, an optimization algorithm used to update a given AI model’s parameters. The second method is pipeline parallelism, a task execution system in which one step’s output is streamed as input to the next step.
Several of GPipe’s performance gains come from better memory allocation for AI models. GPipe has reduced intermediate memory usage on second-generation Google Cloud Tensor Processing Units (TPUs)- each of which contains eight processor cores and 64GB memory (8GB per accelerator) -from 6.26GB to 3.46GB, allowing 318 million parameters on a single accelerator core. Huang says that a single core can only train up to 82 million model parameters without GPipe.
Besides the above advantage, GPipe also partitions models across different accelerators and automatically splits a “mini-batch” of training examples into smaller “micro-batches” by pipelining execution across the micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. Further, gradients are constantly accumulated across micro-batches, so that the number of partitions does not affect the model quality.
In one experiment, Google trained a deep learning algorithm — AmoebaNet — with 557 million model parameters and sample images on TPUs and was able to incorporate 1.8 billion parameters on the 8 accelerators of a Cloud TPUv2, 25x times more than is possible without GPipe. It found that it performed well on those datasets, obtaining results that are competitive to state-of-the-art models. It advances the performance of visual recognition tasks on multiple datasets, including pushing single-crop ImageNet accuracy to 84.3%, CIFAR-10 accuracy to 99.0%, and CIFAR-100 accuracy to 91.3%.
Further, a separate experiment involving AmoebaNet-D algorithm showed that training speed had also improved, by distributing the model across four times the number of second-gen TPU accelerators achieved 3.5 times speedup. Google researchers recorded a speedup of 11 times when they tested Transformer language models with 8 billion parameters on third-generation TPU cores (the newest available) – each of which has 16 cores and 256GB of memory (16GB per core).
“The ongoing development and success of many practical machine learning applications, such as autonomous driving and medical imaging, depend on achieving the highest accuracy possible,” Huang wrote. “As this often requires building larger and even more complex models, we are happy to provide GPipe to the broader research community, and hope it is a useful infrastructure for efficient training of large-scale DNNs.”