Home
Created by Victor
This is a quick guide on how to get you started with the IUSS cluster. Its not exhaustive.
- shell habits that save time
- how to create environments for your projects
- Slurm patterns that prevent avoidable scheduler mistakes
- an
sbatchbuilder for getting to a usable script quickly
You can download it as PDF.
Main Sections
Shell Bash
The shell is the operating surface of our HPC system.
By the end of this tiny tutorial you should be able to:
- know where you are
- find the file
- inspect the log
- confirm the environment
- stop retyping all commands
Request access for the HPC and follow the local onboarding instructions
Now that you are logged in.
You might ask yourself What is this thing?
The cluster uses Bash.
Don’t trust me? Check:
echo $SHELL
You should see
/bin/bash
You are dealing with Unix, which has:
- kernel -> does the real work
- shell -> translates your commands
- programs -> the tools you run
What Actually Happens
If you run
rm myfile
You might think:
You deleted a file
Reality:
- shell finds
rm - kernel executes it
- file disappears forever
- no emotions, no undo
A few tips here and there
Type part of a command and press Tab. If it:
- completes -> good
- does not complete -> be more specific
You can always see the history of the commands you ran, Try it
history
- up-arrow and down-arrow scroll through commands
- stop retyping everything
Always know your state i.e where you are and so on. Here are the commands that can help
pwd # where you are
echo $PATH # where commands come from
history # what you did
!! # rerun the last command
!204 # rerun command number 204 from your history
Moving around
is easy
pwd
ls -lah
cd /path/to/project
cd / # root, top of everything
cd - # go back to the previous directory
cd .. # go up one directory
df -h # check disk space
Level Up Using Aliases
because sometimes the commands are too long or you use some commands relatively more than other
look at this commands
ll-> useful listinggs-> git shortcutpy-> python shortcutrm -i-> asks before deleting
During a session you can create their aliases like this:
alias ll='ls -lah'
alias gs='git status'
alias py='python3'
alias rm='rm -i'
Now typing ll runs ls -lah, and so on. Any command can be made into an alias.
Aliases created this way only last for the current shell session. When you close the terminal, they are gone
Save them in your shell config file to make the persistent.
Open the .bashrc file
nano ~/.bashrc
Add:
alias rm='rm -i'
alias ll='ls -lah'
alias gs='git status'
alias py='python3'
Apply changes:
source ~/.bashrc
You might want to
Stop deleting things like a villain
The default rm:
- deletes instantly
- no warning
- no recovery
Safer version would be if you add an alias
alias rm='rm -i'
Now you get:
remove file? y/n
Document your work but if you misplace a file you can still
Find your stuff before you panic
use a wildcard * to find all files with a particular extension. You can also target particular folders. For:
- recent files ->
-mtime -1 - errors ->
grep - fast log read ->
tail - jump to end ->
less +G
examples
find . -name "*.out"
find . -type f -mtime -1
grep -R "Traceback" logs/
grep -R "error\|fail\|killed" logs/
tail -n 100 slurm-12345.out
less +G logs/run.err
Read the lost and found files
by using this common commands:
cat-> small filesless-> large logshead-> beginningtail-> end
examples:
cat config.yaml
less slurm-12345.out
head -n 20 run.sh
tail -n 40 slurm-12345.out
Permissions and Environment Problems
check out the environments section for more information on that topic.
As for permissions they are expressed as three sets:
[user][group][others]
we have digits like 700, 750 etc. Each digit is a sum of:
4= read (r)2= write (w)1= execute (x)
for example
| Mode | Meaning |
|---|---|
700 | Owner: rwx, Group: —, Others: — |
750 | Owner: rwx, Group: r-x, Others: — |
755 | Owner: rwx, Group: r-x, Others: r-x |
644 | Owner: rw-, Group: r–, Others: r– |
that you can interprete as follows.
700
- Only you (owner) can read/write/execute
rwx - Nobody else can even see inside
- Good for private scripts or sensitive data
750
- You: full access
- Group: read + execute
- Others: no access
755
- Everyone can execute
- Typical for public scripts and system binaries
If your script fails you might be missing the execute permission
Check the script:
ls -l run.sh
If you see:
-rw-r--r-- run.sh
No x → cannot execute
Fix:
chmod +x run.sh
Directory permissions matter too but i think you will discover this on your own.
Debug Strategy if you get an error
ls -l run.sh # permissions
which python # actual interpreter
module list # loaded modules
module purge # clean environment
echo $PATH # execution path
Storage
You want to keep your configs at home. Create various folders for each project and use scratch for temporary data.
Check usage and limits
pwd # where am I?
du -sh . # size of current directory
du -sh * # size per subdirectory (very useful)
df -h # filesystem capacity
To move and transfer data i find it better to compress as it
- reduces size
- avoids slow transfers due to many inodes/filesfirst
tar -czf results.tar.gz results/
you can then extract this files
tar -xzf results.tar.gz
I prefer rsync to transfer my data.
rsync -avh project/ backup/project/
Its
- incremental
- preserves permissions/timestamps
- resumable
Add progress to the transfer
rsync -avh --progress project/ backup/project/
and even remote sync if you wanna tranfer data from local laptop to HPC
rsync -avh project/ user@host:/path/project/
You can a;so use scp
scp file.txt user@host:/path/
If you still want more
You may consult this tutorial by Flavio Copes on freeCodeCamp:
Or download his free book from:
Environments
This page explains how to avoid poisoning your jobs with the wrong Python interpreter, the wrong package set, or incorrect runtime assumptions. The goal is to save you a significant amount of debugging time.
Environments allow you to configure a Python setup for a specific purpose without creating conflicts between packages or package versions required by other applications. Although Python is the most common example, the same principle applies to other languages and toolchains.
You might wonder why this matters. On real systems, especially HPC clusters, you often need the same package in different versions because of dependency constraints. Without isolated environments, these requirements collide and break workflows.
On a cluster, the phrase “it worked yesterday” usually means one of the following happened:
-
you loaded a different module
-
you activated a different environment
-
your batch job did not inherit the environment from your interactive shell
-
you forgot which interpreter the job was actually using
Environments solve these problems by making your runtime explicit, reproducible, and isolated.
Always confirm what you are running:
which python
python --version
module list
echo $PATH
The main tools are:
- Conda environments
- Python virtual environments
- containers
Conda
To create an environment:
conda create -n myenv python=3.11
You will see something like
Retrieving notices: done
Channels:
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json):
You will be prompted to accept package installations make sure to accept or deny as you may will. Let that finish
Activate it:
conda activate myenv
Check what packages you have installed
conda list
Install even more beasty packages:
conda install numpy pandas
pip install xarray
conda install -c conda-forge numpy scipy xarray
if you are bored with a packages remove it
conda remove numpy
Deactivate
conda deactivate
You may end up in a situation where you have created several environments
List and find your desired environment conda env list
conda env list
Then activate it and so on and so what
If you wanna go even more geeky.You can Create an environment from an environment.yml file.
First things first environment.yml file?
Simply put this is a text file that describes:
-
the environment name
-
the channels to use
-
the exact packages and versions
-
optional pip packages
-
optional variables
It is the most reproducible way you can think of as it saves the information and you can come back to it if you have removed your enewctionment and recreate the same environment today, tomorrow, and next year
This also allow same environment creation with your collaborators and supervisors etc. You get the idea
lets make the file
nano environment.yaml
the you can type the information
name: myenv
channels:
- conda-forge
- defaults
dependencies:
- python=3.10
- numpy=1.26
- pandas=2.1
- xarray
- matplotlib
- pip
- pip:
- rasterio
- geopandas
you can press CTRL+O enter. CTRL + X
Explanation:
name: → the environment name
channels: → where Conda should search for packages
dependencies: → Conda packages
pip: → pip-only packages
Once you have an environment.yml file:
ls
environment.yaml
create the env
conda env create -f environment.yml
If the environment already exists and you want to update it
conda env update -f environment.yml --prune
You can recreate an environment from an existing environment Its a common thing especially if you were testing your workflow on local pc and you want to scale it up.
conda activate myenv
conda env export > environment.yml
This captures:
- exact versions
- build numbers
- channels
- pip packages
The only downside is that the exported file often contains too much hashes, local paths and depedencies. Read the file and use your expert knowledge
🐥 The Sweet Duck says: “ You can quench your thirst here
Remove it completely when you are done:
conda remove -n myenv --all
Python Virtual Environments (venv)
I hear you say you said we can use python but why the hell you are not telling us more. And you are right. You can use a plain Python environment
venv is the built‑in Python tool for creating isolated environments.It is lightweight, fast, and ideal when:
- you want minimal overhead
- you only need Python packages (no external libraries)
- you want a reproducible environment tied to a specific Python version
Each environment contains:
- its own Python interpreter
- its own
site-packagesdirectory - its own
pipinstallation
Create an environment in a directory
python3 -m venv .venv
This creates:
.venv/bin/ # Unix/MacOS executables
.venv/Scripts/ # Windows executables
.venv/lib/ # Python libraries
dont forget that you can start at your cd or give pathto/home/user/project_env
Activate
source .venv/bin/activate
Deactivate
deactivate
Install packages
pip install numpy pandas
or a specific version
pip install requests==2.31.0
you can as well upgrade a package
pip install --upgrade requests
and remove it
pip uninstall requests
of course you want to list installed packages before anything
pip list
display details of these packages
pip show numpy
before deciding as to whether you wanna install update and so on you get the idea :)
as we did above in the conda environments, You can export and reproduce environments
save the list of installed packages to requirements.txt
pip freeze > requirements.txt
Then recreate environment from requirements.txt
pip install -r requirements.txt
I know that you can be bored of this environments so purge it by simply deleting the directory:
rm -rf .venv
🐥 The Sweet Duck says: “ remember to quench your thirst here
SLURM
A good place to start is to inspect the cluster you are actually using.
scontrol show node
Example summary
Partitions: cpu, gpu, long
Nodes: cpu001-cpu008, gpu001-gpu002
CPUs/node: inspect with `scontrol show node`
MaxArray: inspect your local limits
If you want to know node states:
sinfo -o '%N %t %C %m'
Tip:
mix= partially allocated.alloc= zero idle CPUs. Submit to amixoridlenode; SLURM handles placement.
you can as well see how many CPU are occupied and how many are free.
sinfo -N -o "%N %t %c %C %m"
If you want to narrow down to a specific node:
scontrol show node <nodename>
scontrol show node cpu003
The Most Important Distinction: Account ≠ Partition
There is a high chance that you are in different accounts with different wall-time limits:
| Account | Max wall time | How to use |
|---|---|---|
default | site-specific | Applied automatically if you do not specify |
project123 | site-specific | Must explicitly add --account=project123 |
Always confirm the right account for your project with your administrators or sacctmgr.
Check your own account associations:
sacctmgr show association user=$USER format=account,partition,qos,maxwall
If your job sits in PENDING with reason AssocMaxWallDurationPerJobLimit, your requested --time exceeds your account limit. The fix is often to use the correct account:
#SBATCH --partition=cpu
#SBATCH --account=project123
#SBATCH --time=2-00:00:00
Submit, Inspect, Cancel
sbatch run_job.sh # submit
squeue -u $USER # what is alive right now
sacct -j 12345 -X # what actually happened (includes finished/failed)
scancel 12345 # cancel one job
scancel 12345 12346 # cancel multiple at once
squeue only shows running/pending jobs. Once a job ends it disappears. Use sacct to see completed, failed, timed-out jobs — always check this before concluding a job “just vanished”:
sacct -u $USER --starttime=now-24hours \
--format=JobID,JobName,State,ExitCode,Elapsed -X
Common states in sacct:
| State | Meaning | What to do |
|---|---|---|
COMPLETED | Exited 0 | Check results |
FAILED | Non-zero exit | Read the .err log |
TIMEOUT | Hit wall time | Increase --time or split workload |
CANCELLED | Manually killed | Normal |
NODE_FAIL | Hardware fault | Resubmit |
PENDING reason AssocMaxWallDurationPerJobLimit | --time > account max | Add the correct --account |
Required header for every job
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=logs/%x_%j.log # %x = job name, %j = job ID
#SBATCH --error=logs/%x_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=06:00:00
#SBATCH --partition=cpu
#SBATCH --account=project123
set -euo pipefail # fail fast — don't silently continue on errors
mkdir -p logs # ensure logs directory exists
cd /path/to/project || exit 1
###############################################
# Activate your environment (choose ONE)
###############################################
### Option A: Activate a Conda environment
# module load anaconda3
# source ~/miniconda3/etc/profile.d/conda.sh
# conda activate project-env # replace with your environment name
### Option B: Activate a Python venv
# source .venv/bin/activate # replace with your venv path
###############################################
# Set the path to your R or Python script
###############################################
# For Python:
# PY_SCRIPT="/path/to/project/scripts/run_model.py"
# For R:
# R_SCRIPT="/path/to/project/scripts/run_analysis.R"
If you want to run Python with Conda, uncomment:
source ~/miniconda3/etc/profile.d/conda.sh
conda activate project-env
PY_SCRIPT="/path/to/project/scripts/run_model.py"
at the bottom of the script
python "$PY_SCRIPT"
If you want to run Python with venv
source .venv/bin/activate
PY_SCRIPT="/path/to/project/scripts/run_model.py"
then
python "$PY_SCRIPT"
same goes to R what changes
Rscript "$R_SCRIPT"
🍃 Tip You can use the Sbatch Builder to help you come up with a quick fast header.
CPUs: What You Ask For vs What Gets Used
#SBATCH --cpus-per-task=8 # reserves 8 cores on the node for this task
#SBATCH --ntasks=1 # 1 task (process)
--cpus-per-taskcontrols how many cores SLURM reserves — it directly affects node placement.- R and Python are single-threaded by default. You are paying for cores you are not using.
- But the number you request controls how many tasks land on each node. This is critical for job arrays.
Rule of thumb: tune --cpus-per-task to match both your software behavior and your site policy.
| cpus-per-task | Max tasks per node | Spread behaviour |
|---|---|---|
| 1 | site-specific | Often packs many tasks onto one node |
| 4 | site-specific | Moderate spread |
| 8 | site-specific | Wider spread |
| full node | site-specific | Exclusive or near-exclusive placement |
If you have any shared I/O between tasks, request 8+ CPUs per task to force SLURM to spread tasks across nodes and avoid resource contention — even if your process only uses 1 core.
Job Arrays is the right way to parallelise
Use arrays when you have N independent jobs that differ only by an index (different starting points, folds, seeds, etc.).
#SBATCH --array=1-8 # runs tasks 1, 2, 3 ... 8
#SBATCH --array=1-8%4 # runs at most 4 simultaneously
#SBATCH --array=1-3 # 3 tasks (e.g. for CRS2-LM global)
Inside the script, the current task index is $SLURM_ARRAY_TASK_ID:
export START_IDX=$SLURM_ARRAY_TASK_ID
export TASK_RUNS_DIR="runs/algo_s${SLURM_ARRAY_TASK_ID}"
mkdir -p "$TASK_RUNS_DIR"
Rscript --vanilla scripts/my_calibration.R
Log naming for arrays — use both job array ID (%A) and task ID (%a):
#SBATCH --output=logs/myjob_%A_%a.log
#SBATCH --error=logs/myjob_%A_%a.err
This gives you logs/myjob_12345_1.log, logs/myjob_12345_2.log, etc. — one file per task.
Arrays are cleaner than N copied scripts and avoid the wall-time problem: instead of one job running N starts sequentially (hitting the wall-time limit after 2), you run N jobs in parallel — each start on its own node, each using only the time it actually needs.
Excluding Problematic Nodes
If specific nodes are overloaded or have known issues, exclude them explicitly:
#SBATCH --exclude=cpu003,cpu008
Check which nodes are fully allocated before submitting:
sinfo -o '%N %t %C'
# Look for 'alloc' state = zero idle CPUs
Dependencies — Chaining Jobs
job1=$(sbatch --parsable preprocess.sh)
sbatch --dependency=afterok:${job1} analyze.sh
afterok: only run if previous job succeeded (exit 0)afterany: run regardless of outcomeafternotok: only run if previous job failed
Useful for: run calibration → then automatically run validation when calibration finishes.
Checking What Actually Ran
# Summary of last 24h
sacct -u $USER --starttime=now-24hours \
--format=JobID,JobName,Partition,Account,State,ExitCode,Elapsed -X
# Detailed resource usage for a specific job
sacct -j 12345 --format=JobID,CPUTime,MaxRSS,Elapsed,State
# All tasks of an array
sacct -j 12345 --format=JobID,State,ExitCode,Elapsed
Diagnosing Failures
| Symptom | Cause | Fix |
|---|---|---|
| Exit code 127 | Rscript / python not in PATH | Add module load R / module load miniconda3 |
AssocMaxWallDurationPerJobLimit | --time exceeds account default | Add the correct --account |
| Job runs longer than expected then TIMEOUT | Wall time set too low | Increase --time within your account limits |
| All array tasks on one node, contention | --cpus-per-task=1 causes SLURM packing | Use --cpus-per-task=8 to force spreading |
| Apptainer file conflicts | Tasks sharing runs/ directory | Give each task runs/${ALGO}_s${SLURM_ARRAY_TASK_ID}/ |
Objective = Inf | APSIM wrapper crash, params not passed | Check .err log; verify APSIM_EXE path exists |
| PENDING forever | Node unavailable or resource mismatch | Check squeue reason column; try excluding problematic nodes |
Minimal Working Template
#!/bin/bash
#SBATCH --job-name=my_array
#SBATCH --output=logs/%x_%A_%a.log
#SBATCH --error=logs/%x_%A_%a.err
#SBATCH --array=1-8
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=2-00:00:00
#SBATCH --partition=cpu
#SBATCH --account=project123
#SBATCH --exclude=cpu003,cpu008
set -euo pipefail
module load R
module load singularity 2>/dev/null || true
cd /path/to/project || exit 1
mkdir -p logs output runs/task_${SLURM_ARRAY_TASK_ID}
export APSIM_EXE="/path/to/project/apsim_wrapper.sh"
export TASK_RUNS_DIR="runs/task_${SLURM_ARRAY_TASK_ID}"
export START_IDX=$SLURM_ARRAY_TASK_ID
echo "Task $START_IDX Node=$SLURMD_NODENAME Start=$(date)"
Rscript --vanilla scripts/my_script.R
echo "Done $START_IDX $(date)"
Quick Reference
# Submit
sbatch run_job.sh
# Monitor
squeue -u $USER -o '%.10i %.14j %.8T %.10M %N'
sacct -u $USER --starttime=now-24hours --format=JobID,JobName,State,ExitCode,Elapsed -X
# Cancel
scancel 12345 # one job
scancel 12345 12346 12347 # several at once
# Check node availability
sinfo -o '%N %t %C' # CPUS: Allocated/Idle/Other/Total
# Check your account limits
sacctmgr show association user=$USER format=account,partition,qos,maxwall
Sbatch Builder
This tool generates a starter sbatch script in the browser.
It starts from the same minimal array-job template shown in the Slurm guide, so the defaults are already shaped like a working example rather than a blank form.
It does not submit jobs. It does not validate partition names. It does not know your cluster better than you do.
Treat it as a fast draft generator that still needs site-specific values from your own cluster.