Run docker-based workload on HPC with GPU

In this case study, we will walk thru how to convert a docker image into singularity format and import it into the cluster, how to look up appropriate hardware, and finally enqueue a job.

Due to security concerns, OAsis HPC supports Singularity rather than Docker. But you may convert docker images easily using command line statements.

Convert a docker image to Singularity

If you ~~are using~~use containers other than Docker and Singularity, please consult this page for details.

~~In order to~~To communicate with our GPU, your container should have CUDA. Version 11.6 is recommended. Besides packaging CUDA from scratch, you may also extend the built-in image nvhpc.22.9-devel-cuda_multi-ubuntu20.04.sif in /pfss/containers. It has lots of GPU libraries and utilities pre-built.

singularity pull julia.1.8.2.sif docker://julia:alpine3.16

# move it to the containers folder, then we can run it in the web portal
mkdir -p ~/containers
mv julia.1.8.2.sif ~/containers

Explore available GPUs

2.There ~~Prepare~~may ~~sbatch~~be ~~arguments~~more than one GPU available for ~~gpu~~your ~~usage~~

account.

Head ~~3.1 First, find~~to the ~~partition~~partitions ~~page,~~page ~~and~~to ~~then~~look ~~open~~up the ~~nodes~~best ~~tab.~~partition for your job.

You may ~~3.2~~also click on the nodes count, to check out the nodes under this partition. In ~~this popup, we can find all~~ the ~~gpus and their availability in this cluster. For~~following example, there ~~are~~is 4only ~~gpus~~one ~~with~~node 1currently available in the gpu ~~core~~partition. That node has 6 idle GPUs ready for use.

a100 is referring to the NVIDIA A100 80GB GPU. 1g.10gb and ~~10gb memory, 1 gpu with 3 gpu cores and~~ 3g.40gb ~~memory,~~are 1a ~~gpu~~MIG ~~with entire a100 capability.~~ partitions.

4. Execute cmd in container by using srun cmd or using sbatch script file. In this example, we use 1 a100 gpu to run nvaccelinfo. The result should show us the gpu info.

4.1 With srun cmd

srun -p gpu --gpus a100:1 singularity exec --nv /pfss/containers/nvhpc.22.9-devel-cuda_multi-ubuntu20.04.sif /bin/sh -c nvidia-smi

4.2 With sbatch script

#!/usr/bin/env bash

#SBATCH -p gpu
#SBATCH --gpus a100:1

singularity exec --nv /pfss/containers/nvhpc.22.9-devel-cuda_multi-ubuntu20.04.sif /bin/sh -c nvaccelinfo