Quick jobs

Quick job is one of our web portal's features. It is an excellent way to unify and speed up your team's workflow. For example, you may define what computing resources are required, what software to use, and where the output goes. You may also expose options to your teammate to fine-tune an individual run.

Quick jobs are typical .sbatch scripts. The portal will open a job launcher window when one clicks a .sbatch file. Then you customize the launcher behavior by optional metadata.

Below is a multi-GPU deep reinforcement learning task with a custom description and several exposed options.

#!/usr/bin/env bash

#SBATCH -J sac
#SBATCH -o sac.out
#SBATCH -p gpu
#SBATCH -n 8
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --gpus-per-task a100:1
#SBATCH --mem-per-cpu=16000

<<setup
desc: Train a Soft Actor Critic (sac) model on OpenAI Gym environments.
inputs:
  - code: env_id
    display: Environment Id
    type: dropdown
    default: BipedalWalker-v3
    options:
      - BipedalWalker-v3
      - LunarLanderContinuous-v2
      - AntBulletEnv-v0
      - InvertedPendulumBulletEnv-v0
      - CartPoleContinuousBulletEnv-v0
      - PongNoFrameskip-v4
    required: true
  - code: num_threads
    display: Number of threads
    type: text
    default: 8
    required: true
  - code: max_episodes
    display: Max episodes
    type: text
    default: 1000
    required: true
  - code: reward_scale
    display: Reward scale
    type: text
    default: 2
    required: true
  - code: alpha
    display: Alpha, learning rate of the actor network
    type: text
    default: 0.0003
    required: true
  - code: beta
    display: Beta, learning rate of the critic network
    type: text
    default: 0.0003
    required: true
  - code: tau
    display: Tau, the rate of updating the target value (the softness)
    type: text
    default: 0.005
    required: true
  - code: batch_size
    display: Batch size
    type: text
    default: 256
    required: true
  - code: layer1_size
    display: Layer1 size
    type: text
    default: 256
    required: true
  - code: layer2_size
    display: Layer2 size
    type: text
    default: 256
    required: true
extra_desc: |+
  output model will be stored in ./sac_model
  loss for each episode is plotted to ./sac_loss.png
setup

module load GCC/11.3.0 OpenMPI/4.1.4

mpiexec singularity exec --nv \
  --env env_id=%env_id% \
  --env num_threads=%num_threads% \
  --env max_episodes=%max_episodes% \
  --env reward_scale=%reward_scale% \
  --env alpha=%alpha% \
  --env beta=%beta% \
  --env tau=%tau% \
  --env batch_size=%batch_size% \
  --env layer1_size=%layer1_size% \
  --env layer2_size=%layer2_size% \
  /pfss/scratch02/appcara/gym.sif python sac.py

There is a section colored orange that defines how it interfaces with users. It is in YAML format and consists of three sub-sections: desc, inputs, and optional extra_desc. The launcher renders the below form to capture the user's input. The system expects placeholders in format %input_code% in the file content and will replace them with the user input.

The built-in quick jobs

The three built-in quick jobs discussed in the previous section are also made of the above syntax. You may find them in the parallel file system.

Jupyter Lab in /pfss/toolkit/start_jupyter.sbatch
VNC in /pfss/toolkit/start_vnc.sbatch
Run container in /pfss/toolkit/run_container.sbatch

Brief introduction to the cluster

Access the cluster

Builtin software

Finding help

Submit jobs

Fine tune your workload

Access your files

Manage your team

Custom software

Quick jobs

Jobs, quota, and setup alerts

Manage accounts and quotas

Billing, cost allocation and reports

Integrate your own workflow with job automation APIs

Run docker-based workload on HPC with GPU

Render 3D graphics with Blender

AI painting with stable diffusion

Run and train chatbots with OpenChatKit

PyTorch with GPU in Jupyter Lab using container-based kernel

Run NVIDIA-Merlin MovieLens Example in Jupyter Lab

Multinode PyTorch Model Training using MPI and Singularity

Running the Vicuna-33B/13B/7B Chatbot with FastChat

Run nemo-megatron-gpt-5B model with NVIDIA NeMo

Accelerating molecular dynamics simulations with MPI and GPU

Accelerate a simple C++ program with MPI and CUDA

Accelerate FASTQ to BAM conversion using GPU and Parabricks

Generate sound effect/music with Meta's AudioCraft

Introduce Nvidia Modulus Symbolic (Modulus Sym)

Nvidia Modulus Symbolic(Modulus Sym) Workflow and Example

Retrieval Augmentation Generation - Langchain integration with local LLM

Using 10x Genomics Cell Ranger

Creating JupyterLab in Kubernetes workspace

Insufficient disk space for Anaconda3

Quick jobs

The built-in quick jobs