Memory is one of the biggest challenges in deep neural networks (DNNs) today. Researchers are struggling with the limited memory bandwidth of the DRAM devices that have to be used by today’s systems to store the huge amounts of weights and activations in DNNs. DRAM capacity appears to be a limitation too. But these challenges are not quite as they seem.

Computer architectures have developed with processor chips specialised for serial processing and DRAMs optimised for high density memory. The interface between these two devices is a major bottleneck that introduces latency and bandwidth limitations and adds a considerable overhead in power consumption.

Although we do not yet have a complete understanding of human brains and how they work, it is generally understood that there is no large, separate memory store. The long- and short-term memory function in human brains is thought to be embedded in the neuron/synapse structure. Even simple organisms such as the C.Elgan worm, with a neural structure made up of just over 300 neurons, has some basic memory functions of this sort.

Building memory into conventional processors is one way of getting around the memory bottleneck problem by opening huge memory bandwidth at much lower power consumption. However, memory on-chip is area expensive and it wouldn’t be possible to add on the large amounts of memory currently attached to the CPU and GPU processors currently used to train and deploy DNNs.

Why do we need such large attached memory storage with CPU and GPU-powered deep learning systems when our brains appear to work well without it?


Memory in neural networks is required to store input data, weight parameters and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate the error gradients in the backwards pass. As an example, the 50-layer ResNet network has ~26 million weight parameters and computes ~16 million activations in the forward pass. If you use a 32-bit floating-point value to store each weight and activation this would give a total storage requirement of 168 MB. We could halve or even quarter this storage requirement by using a lower precision value to store these weights and activations

A greater memory challenge arises from GPUs’ reliance on data being laid out as dense vectors so they can fill very wide single instruction multiple data (SIMD) compute engines, which they use to achieve high compute density. CPUs use similar wide vector units to deliver high-performance arithmetic. In GPUs the vector paths are typically 1024 bits wide, so GPUs using 32-bit floating-point data typically parallelise the training data up into a mini-batch of 32 samples, to create 1024-bit-wide data vectors. This mini-batch approach to synthesizing vector parallelism multiplies the number of activations by a factor of 32, growing the local storage requirement to over 2 GB.

GPUs and other machines designed for matrix algebra also suffer another memory multiplier on either the weights and activations of a neural network. GPUs cannot efficiently execute directly the small convolutions used in deep neural networks. So a transformation called ‘lowering’ is used to convert those convolutions into matrix-matrix multiplications (GEMMs) which GPUs can execute efficiently. Lowering cures execution inefficiency, but at the cost of multiplying either the activation storage or the weight storage by the number of elements in the convolution mask, typically a factor of 9 (3×3 convolution masks). Finally, additional memory is also required to store the input data, temporary values and the program’s instructions. Measuring the memory use of ResNet-50 training with a mini-batch of 32 on a typical high performance GPU shows that it needs over 7.5 GB of local DRAM.

You might think that by using lower-precision compute you could reduce this large memory requirement, but that is not the case for a SIMD machine like a GPU. If you switch to half-precision data values for weights and activations, with a mini-batch of 32, you would only fill half of the SIMD vector width, wasting half of the available compute. To compensate, when you switch from full precision to half precision on a GPU, you also need to double the mini-batch size to induce enough data parallelism to use all the available compute. So switching to lower-precision weights and activations on a GPU still requires over 7.5 GB of local DRAM storage.

You cannot keep such  large amounts of storage  data on the GPU processor. In fact, many high performance GPU processors have only 1 KB of memory associated with each of the processor cores that can be read fast enough to saturate the floating-point datapath. This means that at each layer of the DNN, you need to save the state to external DRAM, load up the next layer of the network and then reload the data to the system. As a result, the already bandwidth and latency constrained off-chip memory interface suffers the additional burden of constantly reloading weights as well as saving and retrieving activations. This significantly slows down the training time while increasing power consumption.


Although large mini-batches improve computational efficiency by providing parallelism, research shows that large mini-batches lead to networks with a poorer ability to generalise and that take longer to train. Besides, machine learning model graphs already expose enormous parallelism. True graph machines such as Graphcore’s IPU don’t need large mini-batches for efficient execution, and they can execute convolutions without the memory bloat of lowering to GEMMs. So IPUs have a very much smaller memory footprint than GPUs, small enough to fit on the processing chip even for large networks. The efficiency and performance gains from doing this are huge.

Decades of work on compilers for sequential programming languages means there are several techniques to reduce memory further. First, operations such as activation functions can be performed ‘in-place’ allowing the input data to be overwritten directly by the output. In this way the memory state can be reused. Secondly, memory can be reused by analysing the data dependencies between operations in a network and allocating the same memory to operations that do not use it concurrently.

This second approach is particularly effective when the entire neural network can be analysed at compile-time to create a fixed allocation of memory, since the runtime overheads of memory management reduce to almost zero. The combination of these techniques has been shown to reduce memory in neural networks by a factor of two to three. These optimisation techniques on a parallel program are analogous to the dataflow analysis in a sequential program graph to allow the reuse of registers and stack memory, with their relatively higher efficiency compared to dynamic memory allocation routines.


  1. I determine to gather besides any person interested in gaining more involved in social concerns pertaining our specialty, contact me direct using my site that you think the similar.

  2. 喜多福,乾拌麵,拌麵,乾麵,呷什麵,關廟麵,手工麵,手作麵,沙茶咖哩,川味麻辣 牛肉風味,油蔥香菇,香蒜麻醬,紅油烏醋,麻辣麻醬,川味麻辣,團購美食,sidofu,日曬麵條,古法手工精製,日光自然曬乾,乾拌麵網路票選,拌麵網路票選 ,乾拌麵台灣票選 ,拌麵台灣票選,乾拌麵排名,拌麵排名,2018乾拌麵之王

  3. There is a bug, that you can withdraw promo money instant.

    This promo was given to the top players of 2018, but it works

    1) Register with promocode newyeartop and get 100$ on balance

    2) Do minimal deposit

    3) Withdraw all deposited money with 100$ ( they have instant withdraw)


    if you dont want to deposit, you can just play for that money

  4. Absolutely NEW update of captcha regignizing software “XRumer 16.0 + XEvil 4.0”:
    captcha recognition of Google (ReCaptcha-2 and ReCaptcha-3), Facebook, BitFinex, Bing, Hotmail, SolveMedia, Yandex,
    and more than 8400 another categories of captchas,
    with highest precision (80..100%) and highest speed (100 img per second).
    You can use XEvil 4.0 with any most popular SEO/SMM software: iMacros, XRumer, GSA SER, ZennoPoster, Srapebox, Senuke, and more than 100 of other programms.

    Interested? You can find a lot of demo videos about XEvil in YouTube.


    Good luck 😉

Leave a Reply

Your email address will not be published.