Brief introduction to the cluster

The cluster consists of many components to provide a good experience for various tasks for various users. Below are some highlights:

Variety in compute nodes
Parallel file system storage
Fast Infiniband and Ethernet network
SSH servers cluster
Web portal
CLI client
Software by modules and containers

This user guide will not cover the detailed hardware spec but instead focus on the experience. If you are interested in those technical details, please get in touch with us.

Compute nodes

We want to provide a heterogenous cluster with a wide variety of hardware and software for users to experience different combinations. Compute nodes may have very different models, architecture, and performances. We carefully build and fine-tune the software on the cluster to ensure they fully leverage the computing power. We provide tools on our web portal to help you choose what suits you.

Besides the OneAsia resources, bringing in hardware is also welcome. Our billing system is smart enough to charge jobs by individual nodes. That means one can submit a giant job to allocate computing power owned by multiple providers. To align the terminology, we group them into three pools:

OneAsia
- Hardware owned by OneAsia Network Limited
Bring-in shared
- Hardware brought by external but willing to share with others
- You could control priority with quota, priority, preemption, and fair share
Bring-in dedicated
- Hardware brought by external but not willing to share

Storage

The cluster has a parallel file system which provides fast and reliable access to your data. We are charging monthly by the maximum allowed quota. You can quickly check your quota through our web portal or CLI client. You may request a larger quota anytime by submitting a ticket to us.

One should see at least three file sets, they are:

User home directory
- To store your persistent data. It is mounted at /pfss/home/$USER and has a default quota of 10GB
- You may access the path with an environment variable: $HOME
User scratch directory
- To be used for I/O during your job. It is mounted at /pfss/scratch01/$USER and has a default quota of 100GB
- We may purge inactive files every 30 days
- You may access the path with an environment variable: $SCRATCH
Group scratch directory
- To share your file with your group mate. It is mounted at /pass/scratch02/$GROUP and has a default quota of 1TB.
- You may access the path with an environment variable: $SCRATCH_<GROUP NAME>

There are many ways you can access your files.

From the web portal file browser
SSH / SFTP
Mount to your local computer using our CLI client

When running jobs, no matter whether you are using our modules or containers. All file sets you have access to will be available.

Networking

Traffic between compute nodes or between compute nodes and the parallel file systems is going through our Infiniband network. Both our modules or containers are compiled with the latest MPI toolchain to utilize the bandwidth fully.

Whether you access the cluster through our web portal or your SSH client, you will be connecting to our SSH servers cluster. Our load balancer will connect you to the server with the least connections. We only grant connections by a private key. No password authentication is allowed. You may connect through the web portal or the CLI client if you don't want to keep the private key.

You will have 8 shared CPU cores, and 8GB of memory. Free of charge for you to prepare your software, job script, and data. You will have access to all your file sets, all modules, containers, SLURM commands, and our CLI client.

Please leverage compute nodes for heavy workloads. If you need more resources on the login node, please submit a ticket and let us help.

Web portal

Our web portal provides many features to make the journey easier. Our goal is to enable users from different backgrounds, to consume HPC resources quickly and efficiently. We also leverage the web portal internally for research and management. We will cover the details in later chapters. Below are some highlights:

Web Terminal
File browser
Software browser
Quick jobs launcher
Job efficiency viewer and alert
Quota control
Team management
Ticket system
Cost allocation

CLI client

To further accelerate the workflow, we created our command line client. We will cover the details later, but below are some example use cases:

Software

The cluster currently provides free software in two ways: Lmod and Containers. Our team is working hard to provide state-of-the-art software which fine-tuned for the cluster's compute nodes. You may log in to our web portal to browse the available software.

Besides software, we also provide pre-trained models and popular data sets. We will cover the details later.

Brief introduction to the cluster

Access the cluster

Builtin software

Finding help

Submit jobs

Fine tune your workload

Access your files

Manage your team

Custom software

Quick jobs

Jobs, quota, and setup alerts

Manage accounts and quotas

Billing, cost allocation and reports

Integrate your own workflow with job automation APIs

Run docker-based workload on HPC with GPU

Render 3D graphics with Blender

AI painting with stable diffusion

Run and train chatbots with OpenChatKit

PyTorch with GPU in Jupyter Lab using container-based kernel

Run NVIDIA-Merlin MovieLens Example in Jupyter Lab

Multinode PyTorch Model Training using MPI and Singularity

Running the Vicuna-33B/13B/7B Chatbot with FastChat

Run nemo-megatron-gpt-5B model with NVIDIA NeMo

Accelerating molecular dynamics simulations with MPI and GPU

Accelerate a simple C++ program with MPI and CUDA

Accelerate FASTQ to BAM conversion using GPU and Parabricks

Generate sound effect/music with Meta's AudioCraft

Introduce Nvidia Modulus Symbolic (Modulus Sym)

Nvidia Modulus Symbolic(Modulus Sym) Workflow and Example

Retrieval Augmentation Generation - Langchain integration with local LLM

Using 10x Genomics Cell Ranger

Creating JupyterLab in Kubernetes workspace

Insufficient disk space for Anaconda3

Brief introduction to the cluster

Compute nodes

Storage

Networking

Web portal

CLI client

Software

Brief introduction to the cluster

Access the cluster

Builtin software

Finding help

Submit jobs

Fine tune your workload

Access your files

Manage your team

Custom software

Quick jobs

Jobs, quota, and setup alerts

Manage accounts and quotas

Billing, cost allocation and reports

Integrate your own workflow with job automation APIs

Run docker-based workload on HPC with GPU

Render 3D graphics with Blender

AI painting with stable diffusion

Run and train chatbots with OpenChatKit

PyTorch with GPU in Jupyter Lab using container-based kernel

Run NVIDIA-Merlin MovieLens Example in Jupyter Lab

Multinode PyTorch Model Training using MPI and Singularity

Running the Vicuna-33B/13B/7B Chatbot with FastChat

Run nemo-megatron-gpt-5B model with NVIDIA NeMo

Accelerating molecular dynamics simulations with MPI and GPU

Accelerate a simple C++ program with MPI and CUDA

Accelerate FASTQ to BAM conversion using GPU and Parabricks

Generate sound effect/music with Meta's AudioCraft

Introduce Nvidia Modulus Symbolic (Modulus Sym)

Nvidia Modulus Symbolic(Modulus Sym) Workflow and Example

Retrieval Augmentation Generation - Langchain integration with local LLM

Using 10x Genomics Cell Ranger

Creating JupyterLab in Kubernetes workspace

Insufficient disk space for Anaconda3

Brief introduction to the cluster

Compute nodes

Storage

Networking

Login nodes farm

Web portal

CLI client

Software