Run docker-based workload on HPC with GPU

In this case study, we will walk thru how to convert a docker image into singularity format and import it into the cluster, how to look up appropriate hardware, and finally enqueue a job.

Due to security concerns, OAsis HPC supports Singularity rather than Docker. But you may convert docker images easily using command line statements.

Convert a docker image to Singularity

If you are using containers other than Docker and Singularity, please consult this page for details.

In order to communicate with our GPU, your container should have CUDA. Version 11.6 is recommended. Besides packaging CUDA from scratch, you may also extend the built-in image nvhpc.22.9-devel-cuda_multi-ubuntu20.04.sif in /pfss/containers. It has lots of GPU libraries and utilities pre-built.

singularity pull julia.1.8.2.sif docker://julia:alpine3.16

# move it to the containers folder, then we can run it in the web portal
mkdir -p ~/containers
mv julia.1.8.2.sif ~/containers

2. Prepare sbatch arguments for gpu usage

3.1 First, find the partition page, and then open the nodes tab.

3.2 In this popup, we can find all the gpus and their availability in this cluster. For example, there are 4 gpus with 1 gpu core and 10gb memory, 1 gpu with 3 gpu cores and 40gb memory, 1 gpu with entire a100 capability.

4. Execute cmd in container by using srun cmd or using sbatch script file. In this example, we use 1 a100 gpu to run nvaccelinfo. The result should show us the gpu info.

4.1 With srun cmd

srun -p gpu --gpus a100:1 singularity exec --nv /pfss/containers/nvhpc.22.9-devel-cuda_multi-ubuntu20.04.sif /bin/sh -c nvidia-smi

4.2 With sbatch script

#!/usr/bin/env bash

#SBATCH -p gpu
#SBATCH --gpus a100:1

singularity exec --nv /pfss/containers/nvhpc.22.9-devel-cuda_multi-ubuntu20.04.sif /bin/sh -c nvaccelinfo