Introduction to Supercomputing Architecture, Linux and job scheduling (SLURM)

Intro to Supercomputing Architechture: Slides:

*Just section II*

Introduction to SLURM: Theory and Usage

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is a widely-used open-source workload manager designed to efficiently allocate computing resources on High-Performance Computing (HPC) clusters. It manages how computational jobs are scheduled, executed, and monitored across the cluster.

How SLURM Works

SLURM operates based on the following key concepts:

Nodes: Individual computers within a cluster, each with multiple CPUs or GPUs.
Partitions: Logical groups of nodes configured by administrators, typically organized by node capability or job duration.
Jobs: Tasks or programs submitted by users to be executed on the cluster.
Scheduler: The core component of SLURM, responsible for managing resources and scheduling jobs based on priority, availability, and job requirements.

When a user submits a job, SLURM places it into a queue. The scheduler prioritizes and allocates resources to jobs based on user requests, resource availability, and cluster policies. Once resources become available, the scheduler assigns the necessary nodes and executes the job automatically.

Submitting Jobs to SLURM

To submit jobs to SLURM, users typically write a simple batch script and then submit it using the sbatch command.

Here's a basic example of a SLURM batch script:

#!/bin/bash
#SBATCH --job-name=my_first_job
#SBATCH --output=output.txt
#SBATCH --error=error.txt
#SBATCH --time=01:00:00
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1

# Load modules if necessary
module load python

# Run your program or command
python myscript.py

--job-name: Specifies the name of your job.
--output and --error: Files to save the standard output and error messages.
--time: Requested maximum runtime of the job (format HH:MM:SS).
--partition: Partition (group of nodes) your job should run on.
--nodes: Number of nodes required.
--ntasks: Number of parallel tasks (typically equal to the number of processes you want to run).

Useful SLURM Commands

sbatch myscript.sh: Submits a job script.
squeue: Lists jobs currently in the queue.
scancel <job_id>: Cancels a job based on its Job ID.
sinfo: Displays information about partitions and node availability.

Checking Job Status

You can monitor your job status with:

squeue -u <username>

This will show all your current jobs and their status (pending, running, etc.).

Canceling Jobs

If you need to cancel a job, use:

scancel <job_id>

Replace <job_id> with the actual ID of the job you want to cancel.

Key SLURM Commands

Command

Purpose

sinfo

View available resources

squeue

See running/pending jobs

sbatch

Submit a batch job

srun

Launch a job step or interactive job

scancel

Cancel a running job

sacct

View accounting/history (if enabled)

Try it out on TAMU FASTER

Needs an ACCESS Account and a TAMU account from ACCESS

Guide from https://hprc.tamu.edu/kb/User-Guides/FASTER/ACCESS-CI/#getting-an-access-account

Authorized ACCESS users can log in using the Web Portal:

Welcome To The CILogon OpenID Connect Authorization Service

Compose a job using Drona Composer

Click on Drona Composer

Setup your SLURM job using the GUI.

job name: ship_fractal

location: leave as is

Environments: Generic

Upload files: select file, add the ship.py file uploading from your local machine as below.

Sample job code:

Make a file on your local machine called ship.py . This is a sample script that we will run on the HPC.

# ship.py

import numpy as np
import matplotlib.pyplot as plt

# Set image resolution
width, height = 1000, 1000
max_iter = 256

# Define viewing window in complex plane
xmin, xmax = -2.0, 1.5
ymin, ymax = -2.0, 0.5

# Generate complex grid
x = np.linspace(xmin, xmax, width)
y = np.linspace(ymin, ymax, height)
X, Y = np.meshgrid(x, y)
C = X + 1j * Y

# Initialize fractal iteration array
Z = np.zeros_like(C)
img = np.zeros(C.shape, dtype=int)

# Compute Burning Ship fractal
for i in range(max_iter):
    Z = (np.abs(Z.real) + 1j * np.abs(Z.imag))**2 + C
    mask = (img == 0) & (np.abs(Z) > 2)
    img[mask] = i

# Plot and save the result
plt.figure(figsize=(10, 10))
plt.imshow(img, cmap='hot', extent=(xmin, xmax, ymin, ymax))
plt.axis('off')
plt.tight_layout()
plt.savefig("burning_ship.png", dpi=300, bbox_inches='tight')
```

Number of Tasks: 1

No Accelerator

Total memory: 40GB

Expected Run Time: 10 Minutes

Project Account: Default one

Click Preview and then and the follow code to below where it says ADD YOUR COMMANDS BELOW


module load GCC/13.3.0 GCC/9.3.0  CUDA/11.0.2  OpenMPI/4.0.3  GCC/9.3.0  OpenMPI/4.0.3 iccifort/2020.1.217  impi/2019.7.217
module load  SciPy-bundle/2020.03-Python-3.8.2 matplotlib/3.2.1-Python-3.8.2
python ship.py

Your template.txt should look like below:

#!/bin/bash
#SBATCH --job-name=ship
#SBATCH --time=1:0:00 --mem=2G
#SBATCH --ntasks=1 --nodes=1 --cpus-per-task=1
#SBATCH --output=out.%j --error=error.%j
#SBATCH   --account=145332967756

module purge
module load WebProxy 
cd /scratch/user/u.sc126842/drona_composer/runs/ship
# ADD YOUR COMMANDS BELOW


module load GCC/13.3.0 GCC/9.3.0  CUDA/11.0.2  OpenMPI/4.0.3  GCC/9.3.0  OpenMPI/4.0.3 iccifort/2020.1.217  impi/2019.7.217
module load  SciPy-bundle/2020.03-Python-3.8.2 matplotlib/3.2.1-Python-3.8.2
python ship.py

Click submit.

Go back to main dashboard, jobs, Active Jobs to view the job and file output.

If you job completes then go: dashboard, files, scratch, and a path like this to find the job

/scratch/user/u.sc126842/drona_composer/runs/ship

PreviousJetstream 2 tutorial NextACCESS PEGASUS

Last updated 3 months ago