Optimizing PyTorch Docker images: how to cut size by 60%

When training neural networks on a cluster, using Docker containers is one of the easiest ways to package up all the dependencies. However, PyTorch and CUDA are both large libraries, which can make your Docker images quite hefty. For example, we might have the following dockerfile:

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install --no-install-recommends -y \
        build-essential \
        python3.10 \
        python3-pip && \
    	apt clean && rm -rf /var/lib/apt/lists/*

RUN pip3 install torch torchvision torchaudio \
        --index-url https://download.pytorch.org/whl/cu118

It installs CUDA, Python, and PyTorch, providing a general-purpose setup for neural network training. However, the resulting image is a staggering 7.6 GB! Fortunately, with a few optimizations, you can significantly reduce the size of this image.

You don't need CUDA

Perhaps surprisingly, you don't need the CUDA runtime at all! PyTorch includes all the necessary CUDA binaries, so you can save space by switching to a lighter base image. A one-line change brings 2 gigs of saved space:

# FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
FROM nvidia/cuda:11.8.0-base-ubuntu22.04

The resulting image size is 5.6G which is 26% lower.

No caching, please

Pip caches all downloaded packages. This is useful on local machines, where several virtual environments might use the same packages. Having them cached means that they don't have to be downloaded again and again. This saves quite a lot of time when setting up a new environment.

However, do you remember that PyTorch comes with CUDA bundled? That makes the total size of the torch package quite large, around 2.5G. When pip caches this, it adds an extra 2.5 gigs of unused space!

Simply turning off caching solves the problem:

RUN pip3 --no-cache-dir install torch torchvision torchaudio \
        --index-url https://download.pytorch.org/whl/cu118

With this adjustment, the image size drops to just 2.9 GB, achieving a 62% reduction from the original size.

Bonus: save size when building CUDA kernels

The first optimization works well if you only need runtime libraries, but what if you need to build custom CUDA kernels before training? In this case, you'd have to use the devel base image, which is significantly larger at 5 GB (compared to the 800 MB of the base image).

Because we often modify custom kernels, multi-stage builds aren't practical; rebuilding the image every time would be too cumbersome.

Looking into the devel base image, one can notice that it installs a wide array of development packages:

apt-get install -y --no-install-recommends cuda-cudart-dev-11-8                   \
    cuda-command-line-tools-11-8 cuda-minimal-build-11-8 cuda-libraries-dev-11-8  \
    cuda-nvml-dev-11-8 libnpp-dev-11-8 libcusparse-dev-11-8 libcublas-dev-11-8    \
    libnccl-dev cuda-nsight-compute-11-8

This setup includes a debugger, the Nsight profiler, headers for specialized libraries, such as NVML. However, for model training, you definitely don't need a debugger or a profiler. Moreover, most CUDA kernels just perform arithmetic operations, and do not need specialized libraries, such as NCCL or cuBlas.

The only crucial package is cuda-minimal-build. Adding this package to your Docker build increases the image size by just 100 MB, a small price for the ability to build custom kernels.

Conclusion

By carefully selecting the right base image and disabling unnecessary caching, you can significantly reduce the size of your PyTorch Docker images. These optimizations can lead to space savings of up to 60%, making your Docker setup more efficient and easier to manage.