Introduction to Supercomputing Architecture, Linux and job scheduling (SLURM)
Intro to Supercomputing Architechture:
Slides
Introduction to SLURM: Theory and Usage
What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is a widely-used open-source workload manager designed to efficiently allocate computing resources on High-Performance Computing (HPC) clusters. It manages how computational jobs are scheduled, executed, and monitored across the cluster.
How SLURM Works
SLURM operates based on the following key concepts:
Nodes: Individual computers within a cluster, each with multiple CPUs or GPUs.
Partitions: Logical groups of nodes configured by administrators, typically organized by node capability or job duration.
Jobs: Tasks or programs submitted by users to be executed on the cluster.
Scheduler: The core component of SLURM, responsible for managing resources and scheduling jobs based on priority, availability, and job requirements.
When a user submits a job, SLURM places it into a queue. The scheduler prioritizes and allocates resources to jobs based on user requests, resource availability, and cluster policies. Once resources become available, the scheduler assigns the necessary nodes and executes the job automatically.
Submitting Jobs to SLURM
To submit jobs to SLURM, users typically write a simple batch script and then submit it using the sbatch
command.
Here's a basic example of a SLURM batch script:
--job-name
: Specifies the name of your job.--output
and--error
: Files to save the standard output and error messages.--time
: Requested maximum runtime of the job (format HH:MM:SS).--partition
: Partition (group of nodes) your job should run on.--nodes
: Number of nodes required.--ntasks
: Number of parallel tasks (typically equal to the number of processes you want to run).
Useful SLURM Commands
sbatch myscript.sh
: Submits a job script.squeue
: Lists jobs currently in the queue.scancel <job_id>
: Cancels a job based on its Job ID.sinfo
: Displays information about partitions and node availability.
Checking Job Status
You can monitor your job status with:
This will show all your current jobs and their status (pending, running, etc.).
Canceling Jobs
If you need to cancel a job, use:
Replace <job_id>
with the actual ID of the job you want to cancel.
Key SLURM Commands
Command
Purpose
sinfo
View available resources
squeue
See running/pending jobs
sbatch
Submit a batch job
srun
Launch a job step or interactive job
scancel
Cancel a running job
sacct
View accounting/history (if enabled)
Try it out on TAMU FASTER
Guide from https://hprc.tamu.edu/kb/User-Guides/FASTER/ACCESS-CI/#getting-an-access-account
Authorized ACCESS users can log in using the Web Portal:
Compose a job using Drona Composer
Click on Drona Composer
Setup your SLURM job using the GUI.
job name: ship_fractal
location: leave as is
Environments: Generic
Upload files: select file, add the ship.py file uploading from your local machine as below.
Sample job code:
Make a file on your local machine called ship.py . This is a sample script that we will run on the HPC.
Number of Tasks: 1
No Accelerator
Total memory: 40GB
Expected Run Time: 10 Minutes
Project Account: Default one
Click Preview and then and the follow code to below where it says ADD YOUR COMMANDS BELOW
Your template.txt should look like below:
Click submit.
Go back to main dashboard, jobs, Active Jobs to view the job and file output.
If you job completes then go: dashboard, files, scratch, and a path like this to find the job
/scratch/user/u.sc126842/drona_composer/runs/ship
Last updated