Memory is a limiting resource for many deep learning tasks. Beside the neural network weights, one main memory consumer is the computation graph built up by automatic differentiation (AD) for backpropagation. We observe that PyTorch’s current AD implementation sometimes neglects information about parameter differentiability when storing the computation graph. This information is useful though to reduce memory whenever gradients are requested for a parameter subset, as is the case in many modern fine-tuning tasks. Specifically, inputs to layers that act linearly in their parameters and inputs (fully-connected, convolution, or batch normalization layers in evaluation mode) can be discarded whenever the parameters are marked as non-differentiable. We provide a drop-in, differentiability-agnostic implementation of such layers and demonstrate its ability to reduce memory without affecting run time on popular convolution- and attention-based architectures.
Saving Memory in CNNs (with PyTorch)
CNNs are made of mainly Convolution layers along with activations such as ReLU and normalization/pooling layers such as batch normalization and max pooling.
Motivation
In torch these layers are implemented by torch.nn.Conv2d
, torch.nn.ReLU
, torch.nn.BatchNorm2d
,torch.nn.MaxPool2d
, and give rise to very fast code that calls native C++ and CUDA kernels at the lowest level. This holds for the forward pass as well as the backward pass. However, sometimes memory efficiency is traded off for better time efficiency by storing a lot of tensors that we might need. But this results in consumer-level GPUs to not have enough VRAM for performing these tasks with a decent batch size. Here, we try to make that possible by implementing our own memory saving layers while also not giving up time efficiency. These can be found in the memsave
module.
==An important use case is when you want to slightly alter a layer==
Results
Here are summarized results on 4 models (plots from the poster):