Memory is one of the biggest challenges in deep neural networks (DNNs) today. Researchers are struggling with the limited memory bandwidth of the DRAM devices that have to be used by today’s systems to store the huge amounts of weights and activations in DNNs. DRAM capacity appears to be a limitation too. But these challenges are not quite as they seem.
Computer architectures have developed with processor chips specialised for serial processing and DRAMs optimised for high density memory. The interface between these two devices is a major bottleneck that introduces latency and bandwidth limitations and adds a considerable overhead in power consumption.
Although we do not yet have a complete understanding of human brains and how they work, it is generally understood that there is no large, separate memory store. The long- and short-term memory function in human brains is thought to be embedded in the neuron/synapse structure. Even simple organisms such as the C.Elgan worm, with a neural structure made up of just over 300 neurons, has some basic memory functions of this sort.
Building memory into conventional processors is one way of getting around the memory bottleneck problem by opening huge memory bandwidth at much lower power consumption. However, memory on-chip is area expensive and it wouldn’t be possible to add on the large amounts of memory currently attached to the CPU and GPU processors currently used to train and deploy DNNs.
Why do we need such large attached memory storage with CPU and GPU-powered deep learning systems when our brains appear to work well without it?
WHY DO DEEP NEURAL NETWORKS NEED SO MUCH MEMORY?
Memory in neural networks is required to store input data, weight parameters and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate the error gradients in the backwards pass. As an example, the 50-layer ResNet network has ~26 million weight parameters and computes ~16 million activations in the forward pass. If you use a 32-bit floating-point value to store each weight and activation this would give a total storage requirement of 168 MB. We could halve or even quarter this storage requirement by using a lower precision value to store these weights and activations
A greater memory challenge arises from GPUs’ reliance on data being laid out as dense vectors so they can fill very wide single instruction multiple data (SIMD) compute engines, which they use to achieve high compute density. CPUs use similar wide vector units to deliver high-performance arithmetic. In GPUs the vector paths are typically 1024 bits wide, so GPUs using 32-bit floating-point data typically parallelise the training data up into a mini-batch of 32 samples, to create 1024-bit-wide data vectors. This mini-batch approach to synthesizing vector parallelism multiplies the number of activations by a factor of 32, growing the local storage requirement to over 2 GB.
GPUs and other machines designed for matrix algebra also suffer another memory multiplier on either the weights and activations of a neural network. GPUs cannot efficiently execute directly the small convolutions used in deep neural networks. So a transformation called ‘lowering’ is used to convert those convolutions into matrix-matrix multiplications (GEMMs) which GPUs can execute efficiently. Lowering cures execution inefficiency, but at the cost of multiplying either the activation storage or the weight storage by the number of elements in the convolution mask, typically a factor of 9 (3×3 convolution masks). Finally, additional memory is also required to store the input data, temporary values and the program’s instructions. Measuring the memory use of ResNet-50 training with a mini-batch of 32 on a typical high performance GPU shows that it needs over 7.5 GB of local DRAM.
You might think that by using lower-precision compute you could reduce this large memory requirement, but that is not the case for a SIMD machine like a GPU. If you switch to half-precision data values for weights and activations, with a mini-batch of 32, you would only fill half of the SIMD vector width, wasting half of the available compute. To compensate, when you switch from full precision to half precision on a GPU, you also need to double the mini-batch size to induce enough data parallelism to use all the available compute. So switching to lower-precision weights and activations on a GPU still requires over 7.5 GB of local DRAM storage.
You cannot keep such large amounts of storage data on the GPU processor. In fact, many high performance GPU processors have only 1 KB of memory associated with each of the processor cores that can be read fast enough to saturate the floating-point datapath. This means that at each layer of the DNN, you need to save the state to external DRAM, load up the next layer of the network and then reload the data to the system. As a result, the already bandwidth and latency constrained off-chip memory interface suffers the additional burden of constantly reloading weights as well as saving and retrieving activations. This significantly slows down the training time while increasing power consumption.
THREE APPROACHES FOR MEMORY-SAVING TECHNIQUES
Although large mini-batches improve computational efficiency by providing parallelism, research shows that large mini-batches lead to networks with a poorer ability to generalise and that take longer to train. Besides, machine learning model graphs already expose enormous parallelism. True graph machines such as Graphcore’s IPU don’t need large mini-batches for efficient execution, and they can execute convolutions without the memory bloat of lowering to GEMMs. So IPUs have a very much smaller memory footprint than GPUs, small enough to fit on the processing chip even for large networks. The efficiency and performance gains from doing this are huge.
Decades of work on compilers for sequential programming languages means there are several techniques to reduce memory further. First, operations such as activation functions can be performed ‘in-place’ allowing the input data to be overwritten directly by the output. In this way the memory state can be reused. Secondly, memory can be reused by analysing the data dependencies between operations in a network and allocating the same memory to operations that do not use it concurrently.
This second approach is particularly effective when the entire neural network can be analysed at compile-time to create a fixed allocation of memory, since the runtime overheads of memory management reduce to almost zero. The combination of these techniques has been shown to reduce memory in neural networks by a factor of two to three. These optimisation techniques on a parallel program are analogous to the dataflow analysis in a sequential program graph to allow the reuse of registers and stack memory, with their relatively higher efficiency compared to dynamic memory allocation routines.