Rebecca Hartman-Baker, PhD User Engagement Group Lead Charles Lively III, PhD Science Engagement Engineer Helen He, PhD User Engagement Group June 28, 2024
Introduction to SLURM: Theory and Usage
What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is a widely-used open-source workload manager designed to efficiently allocate computing resources on High-Performance Computing (HPC) clusters. It manages how computational jobs are scheduled, executed, and monitored across the cluster.
How SLURM Works
SLURM operates based on the following key concepts:
Nodes: Individual computers within a cluster, each with multiple CPUs or GPUs.
Partitions: Logical groups of nodes configured by administrators, typically organized by node capability or job duration.
Jobs: Tasks or programs submitted by users to be executed on the cluster.
Scheduler: The core component of SLURM, responsible for managing resources and scheduling jobs based on priority, availability, and job requirements.
When a user submits a job, SLURM places it into a queue. The scheduler prioritizes and allocates resources to jobs based on user requests, resource availability, and cluster policies. Once resources become available, the scheduler assigns the necessary nodes and executes the job automatically.
Submitting Jobs to SLURM
To submit jobs to SLURM, users typically write a simple batch script and then submit it using the sbatch command.
Here's a basic example of a SLURM batch script:
--job-name: Specifies the name of your job.
--output and --error: Files to save the standard output and error messages.
--time: Requested maximum runtime of the job (format HH:MM:SS).
--partition: Partition (group of nodes) your job should run on.
--nodes: Number of nodes required.
--ntasks: Number of parallel tasks (typically equal to the number of processes you want to run).
Useful SLURM Commands
sbatch myscript.sh: Submits a job script.
squeue: Lists jobs currently in the queue.
scancel <job_id>: Cancels a job based on its Job ID.
sinfo: Displays information about partitions and node availability.
Checking Job Status
You can monitor your job status with:
This will show all your current jobs and their status (pending, running, etc.).
Canceling Jobs
If you need to cancel a job, use:
Replace <job_id> with the actual ID of the job you want to cancel.
Key SLURM Commands
Command
Purpose
sinfo
View available resources
squeue
See running/pending jobs
sbatch
Submit a batch job
srun
Launch a job step or interactive job
scancel
Cancel a running job
sacct
View accounting/history (if enabled)
Try it out on TAMU FASTER
Needs an ACCESS Account and a TAMU account from ACCESS