Skip to main content

Jobs, quota, and setup alerts

You may want to check the jobs teammate submitted to ensure they are reasonably leveraging your resources. This article covers how you check jobs, how the quota system works, and how we can set up alerts to let the system monitor for you.

Check running, queuing, and completed jobs

Job owners can inspect their jobs through the top-right corner dropdown. On the other hand, an account owner can review every member's jobs on the account page. So, click the account name you want to check from the Accounts page. Then, in the overview sub-page, you will see links to inspect jobs. Finally, click either running or completed jobs to open the jobs window.

In the running tab, you see jobs currently running under the selected account, their requester, the partition, the duration, and the per-job real-time CPU or memory utilization. You may cancel any job by clicking the cancel button.

In the queuing tab, you see jobs currently waiting inside queues, the partition they are in, and the requested CPU or memory. You may cancel or change its priority.

In the completed tab, there are jobs already completed or failed. Click the job ID to view the detailed charge and utilization status.

If you want to sort them by CPU or memory utilization, you may click the gear button at the top-right corner to toggle the columns of the table.

Setup alerts about utilization

To spot under-utilized jobs, we may inspect jobs on the portal in real time. But the system also provided a way to monitor it automatically. Switch to the settings tab on the account page, and you will see a job efficiency monitor section.

job-eff-monitor.png

By default, the system notifies the owner if their job is using below 50% of either CPU or memory. The system will not count the first 10 minutes because we assume the application is warming up. You may play around with the settings for your need.

How quota works

OAsis uses our quota system, which is different from a typical SLURM setting. It allows six meters setting instead of a combined total number.

  • CPU Oneasia
  • GPU Oneasia
  • CPU Shared
  • GPU Shared
  • CPU Dedicated
  • GPU Dedicated

As their names tell, they are referring to CPU and GPU usage over 3 node pools. The unit of CPU usage is the number of hours spent on one AMD EPYC 7713 core. On the other hand, the number of hours spent on one NVIDIA A100 GPU card.

Quota is applied on the account (group) level and considers not just your account quota but every upper-level account. For example, an institute may have 1,000 units of "GPU Oneasia" evenly distributed to 4 departments. And the departments can assign them to each project group. The system only accepts jobs when all levels (institute, department, project group) have enough quota.

The system supports a custom reset period per account. You may choose from weekly, monthly, quarterly, and yearly.

Check current usage and my quota

You may check them through the web portal on the accounts page.

view-quota.png

You may also check them through the CLI client as the following:

$ hc quotas
# Account   | CPU/Mem Oneasia   | CPU/Mem Shared    | GPU Shared        | CPU/Mem Dedicated | GPU Oneasia       | GPU Dedicated
# appcara   | 0.2 / 800         | 0.0               | 0.0               | 0.0               | 0.0 / 100         | 0.0

# of if you prefer a JSON format
$ hc quotas -o json
# [
#  {
#    "account_id": "appcara",
#    "quota": {
#      "oneasia_csu": 800.0,
#      "oneasia_gsu": 100.0
#    },
#    "usage": {
#      "dedicated_csu": 0.0,
#      "dedicated_gsu": 0.0,
#      "oneasia_gsu": 0.047666665,
#      "shared_csu": 0.0,
#      "shared_gsu": 0.0,
#      "oneasia_csu": 0.16666667
#    }
#  }
# ]

Set quota and auto alerts

If your upper-level account empowered you to modify quotas, you could do this on the account settings page.

quota-settings.png

You may change the "Behavior when quota exceeded" from "Notify Only" to "Auto kill jobs" if you want a hard quota limit.