Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Home

Created by Victor

This is a quick guide on how to get you started with the IUSS cluster. Its not exhaustive.

  • shell habits that save time
  • how to create environments for your projects
  • Slurm patterns that prevent avoidable scheduler mistakes
  • an sbatch builder for getting to a usable script quickly

You can download it as PDF.

Main Sections

Shell Bash

The shell is the operating surface of our HPC system.

By the end of this tiny tutorial you should be able to:

  • know where you are
  • find the file
  • inspect the log
  • confirm the environment
  • stop retyping all commands

Request access for the HPC and follow the local onboarding instructions

Now that you are logged in.

You might ask yourself What is this thing?

The cluster uses Bash.

Don’t trust me? Check:

echo $SHELL

You should see

/bin/bash

You are dealing with Unix, which has:

  • kernel -> does the real work
  • shell -> translates your commands
  • programs -> the tools you run

What Actually Happens

If you run

rm myfile

You might think:

You deleted a file

Reality:

  • shell finds rm
  • kernel executes it
  • file disappears forever
  • no emotions, no undo

A few tips here and there

Type part of a command and press Tab. If it:

  • completes -> good
  • does not complete -> be more specific

You can always see the history of the commands you ran, Try it

history
  • up-arrow and down-arrow scroll through commands
  • stop retyping everything

Always know your state i.e where you are and so on. Here are the commands that can help

pwd           # where you are
echo $PATH    # where commands come from
history       # what you did
!!            # rerun the last command
!204          # rerun command number 204 from your history

Moving around

is easy

pwd
ls -lah
cd /path/to/project
cd /                  # root, top of everything
cd -                  # go back to the previous directory
cd ..                 # go up one directory
df -h                 # check disk space

Level Up Using Aliases

because sometimes the commands are too long or you use some commands relatively more than other

look at this commands

  • ll -> useful listing
  • gs -> git shortcut
  • py -> python shortcut
  • rm -i -> asks before deleting

During a session you can create their aliases like this:

alias ll='ls -lah'
alias gs='git status'
alias py='python3'
alias rm='rm -i'

Now typing ll runs ls -lah, and so on. Any command can be made into an alias.

Aliases created this way only last for the current shell session. When you close the terminal, they are gone

Save them in your shell config file to make the persistent.

Open the .bashrc file

nano ~/.bashrc

Add:

alias rm='rm -i'
alias ll='ls -lah'
alias gs='git status'
alias py='python3'

Apply changes:

source ~/.bashrc

You might want to

Stop deleting things like a villain

The default rm:

  • deletes instantly
  • no warning
  • no recovery

Safer version would be if you add an alias

alias rm='rm -i'

Now you get:

remove file? y/n

Document your work but if you misplace a file you can still

Find your stuff before you panic

use a wildcard * to find all files with a particular extension. You can also target particular folders. For:

  • recent files -> -mtime -1
  • errors -> grep
  • fast log read -> tail
  • jump to end -> less +G

examples

find . -name "*.out"
find . -type f -mtime -1
grep -R "Traceback" logs/
grep -R "error\|fail\|killed" logs/
tail -n 100 slurm-12345.out
less +G logs/run.err

Read the lost and found files

by using this common commands:

  • cat -> small files
  • less -> large logs
  • head -> beginning
  • tail -> end

examples:

cat config.yaml
less slurm-12345.out
head -n 20 run.sh
tail -n 40 slurm-12345.out

Permissions and Environment Problems

check out the environments section for more information on that topic.

As for permissions they are expressed as three sets:

[user][group][others]

we have digits like 700, 750 etc. Each digit is a sum of:

  • 4 = read (r)
  • 2 = write (w)
  • 1 = execute (x)

for example

ModeMeaning
700Owner: rwx, Group: —, Others: —
750Owner: rwx, Group: r-x, Others: —
755Owner: rwx, Group: r-x, Others: r-x
644Owner: rw-, Group: r–, Others: r–

that you can interprete as follows.

700

  • Only you (owner) can read/write/execute rwx
  • Nobody else can even see inside
  • Good for private scripts or sensitive data

750

  • You: full access
  • Group: read + execute
  • Others: no access

755

  • Everyone can execute
  • Typical for public scripts and system binaries

If your script fails you might be missing the execute permission

Check the script:

ls -l run.sh

If you see:

-rw-r--r-- run.sh

No x → cannot execute

Fix:

chmod +x run.sh

Directory permissions matter too but i think you will discover this on your own.

Debug Strategy if you get an error

ls -l run.sh        # permissions
which python        # actual interpreter
module list         # loaded modules
module purge        # clean environment
echo $PATH          # execution path

Storage

You want to keep your configs at home. Create various folders for each project and use scratch for temporary data.

Check usage and limits

pwd              # where am I?
du -sh .         # size of current directory
du -sh *         # size per subdirectory (very useful)
df -h            # filesystem capacity

To move and transfer data i find it better to compress as it

  • reduces size
  • avoids slow transfers due to many inodes/filesfirst
tar -czf results.tar.gz results/

you can then extract this files

tar -xzf results.tar.gz

I prefer rsync to transfer my data.

rsync -avh project/ backup/project/

Its

  • incremental
  • preserves permissions/timestamps
  • resumable

Add progress to the transfer

rsync -avh --progress project/ backup/project/

and even remote sync if you wanna tranfer data from local laptop to HPC

rsync -avh project/ user@host:/path/project/

You can a;so use scp

scp file.txt user@host:/path/

If you still want more

You may consult this tutorial by Flavio Copes on freeCodeCamp:

Or download his free book from:

Environments

This page explains how to avoid poisoning your jobs with the wrong Python interpreter, the wrong package set, or incorrect runtime assumptions. The goal is to save you a significant amount of debugging time.

Environments allow you to configure a Python setup for a specific purpose without creating conflicts between packages or package versions required by other applications. Although Python is the most common example, the same principle applies to other languages and toolchains.

You might wonder why this matters. On real systems, especially HPC clusters, you often need the same package in different versions because of dependency constraints. Without isolated environments, these requirements collide and break workflows.

On a cluster, the phrase “it worked yesterday” usually means one of the following happened:

  • you loaded a different module

  • you activated a different environment

  • your batch job did not inherit the environment from your interactive shell

  • you forgot which interpreter the job was actually using

Environments solve these problems by making your runtime explicit, reproducible, and isolated.

Always confirm what you are running:

which python
python --version
module list
echo $PATH

The main tools are:

  • Conda environments
  • Python virtual environments
  • containers

Conda

To create an environment:

conda create -n myenv python=3.11

You will see something like

Retrieving notices: done
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): 

You will be prompted to accept package installations make sure to accept or deny as you may will. Let that finish

Activate it:

conda activate myenv

Check what packages you have installed

conda list

Install even more beasty packages:

conda install numpy pandas
pip install xarray
conda install -c conda-forge numpy scipy xarray

if you are bored with a packages remove it

conda remove numpy

Deactivate

conda deactivate

You may end up in a situation where you have created several environments

List and find your desired environment conda env list

conda env list

Then activate it and so on and so what

If you wanna go even more geeky.You can Create an environment from an environment.yml file.

First things first environment.yml file?

Simply put this is a text file that describes:

  • the environment name

  • the channels to use

  • the exact packages and versions

  • optional pip packages

  • optional variables

It is the most reproducible way you can think of as it saves the information and you can come back to it if you have removed your enewctionment and recreate the same environment today, tomorrow, and next year

This also allow same environment creation with your collaborators and supervisors etc. You get the idea

lets make the file

nano environment.yaml

the you can type the information

name: myenv
channels:
  - conda-forge
  - defaults

dependencies:
  - python=3.10
  - numpy=1.26
  - pandas=2.1
  - xarray
  - matplotlib
  - pip
  - pip:
      - rasterio
      - geopandas

you can press CTRL+O enter. CTRL + X

Explanation:

name: → the environment name

channels: → where Conda should search for packages

dependencies: → Conda packages

pip: → pip-only packages

Once you have an environment.yml file:

ls 
environment.yaml

create the env

conda env create -f environment.yml

If the environment already exists and you want to update it

conda env update -f environment.yml --prune

You can recreate an environment from an existing environment Its a common thing especially if you were testing your workflow on local pc and you want to scale it up.

conda activate myenv
conda env export > environment.yml

This captures:

  • exact versions
  • build numbers
  • channels
  • pip packages

The only downside is that the exported file often contains too much hashes, local paths and depedencies. Read the file and use your expert knowledge

🐥 The Sweet Duck says: “ You can quench your thirst here

Remove it completely when you are done:

conda remove -n myenv --all

Python Virtual Environments (venv)

I hear you say you said we can use python but why the hell you are not telling us more. And you are right. You can use a plain Python environment

venv is the built‑in Python tool for creating isolated environments.It is lightweight, fast, and ideal when:

  • you want minimal overhead
  • you only need Python packages (no external libraries)
  • you want a reproducible environment tied to a specific Python version

Each environment contains:

  • its own Python interpreter
  • its own site-packages directory
  • its own pip installation

Create an environment in a directory

python3 -m venv .venv

This creates:

.venv/bin/      # Unix/MacOS executables
.venv/Scripts/  # Windows executables
.venv/lib/      # Python libraries

dont forget that you can start at your cd or give pathto/home/user/project_env

Activate

source .venv/bin/activate

Deactivate

deactivate

Install packages

pip install numpy pandas

or a specific version

pip install requests==2.31.0

you can as well upgrade a package

pip install --upgrade requests

and remove it

pip uninstall requests

of course you want to list installed packages before anything

pip list

display details of these packages

pip show numpy

before deciding as to whether you wanna install update and so on you get the idea :)

as we did above in the conda environments, You can export and reproduce environments

save the list of installed packages to requirements.txt

pip freeze > requirements.txt

Then recreate environment from requirements.txt

pip install -r requirements.txt

I know that you can be bored of this environments so purge it by simply deleting the directory:

rm -rf .venv

🐥 The Sweet Duck says: “ remember to quench your thirst here

SLURM

A good place to start is to inspect the cluster you are actually using.

scontrol show node 

Example summary

Partitions: cpu, gpu, long
Nodes:      cpu001-cpu008, gpu001-gpu002
CPUs/node:  inspect with `scontrol show node`
MaxArray:   inspect your local limits

If you want to know node states:

sinfo -o '%N %t %C %m'

Tip: mix = partially allocated. alloc = zero idle CPUs. Submit to a mix or idle node; SLURM handles placement.

you can as well see how many CPU are occupied and how many are free.

sinfo -N -o "%N %t %c %C %m"

If you want to narrow down to a specific node:

scontrol show node <nodename>
scontrol show node cpu003

The Most Important Distinction: Account ≠ Partition

There is a high chance that you are in different accounts with different wall-time limits:

AccountMax wall timeHow to use
defaultsite-specificApplied automatically if you do not specify
project123site-specificMust explicitly add --account=project123

Always confirm the right account for your project with your administrators or sacctmgr.

Check your own account associations:

sacctmgr show association user=$USER format=account,partition,qos,maxwall

If your job sits in PENDING with reason AssocMaxWallDurationPerJobLimit, your requested --time exceeds your account limit. The fix is often to use the correct account:

#SBATCH --partition=cpu
#SBATCH --account=project123
#SBATCH --time=2-00:00:00

Submit, Inspect, Cancel

sbatch run_job.sh          # submit
squeue -u $USER            # what is alive right now
sacct -j 12345 -X          # what actually happened (includes finished/failed)
scancel 12345              # cancel one job
scancel 12345 12346        # cancel multiple at once

squeue only shows running/pending jobs. Once a job ends it disappears. Use sacct to see completed, failed, timed-out jobs — always check this before concluding a job “just vanished”:

sacct -u $USER --starttime=now-24hours \
  --format=JobID,JobName,State,ExitCode,Elapsed -X

Common states in sacct:

StateMeaningWhat to do
COMPLETEDExited 0Check results
FAILEDNon-zero exitRead the .err log
TIMEOUTHit wall timeIncrease --time or split workload
CANCELLEDManually killedNormal
NODE_FAILHardware faultResubmit
PENDING reason AssocMaxWallDurationPerJobLimit--time > account maxAdd the correct --account

Required header for every job

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=logs/%x_%j.log      # %x = job name, %j = job ID
#SBATCH --error=logs/%x_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=06:00:00
#SBATCH --partition=cpu
#SBATCH --account=project123

set -euo pipefail                    # fail fast — don't silently continue on errors

mkdir -p logs                        # ensure logs directory exists
cd /path/to/project || exit 1

###############################################
# Activate your environment (choose ONE)
###############################################

### Option A: Activate a Conda environment
# module load anaconda3            
# source ~/miniconda3/etc/profile.d/conda.sh
# conda activate project-env       # replace with your environment name

### Option B: Activate a Python venv
# source .venv/bin/activate        # replace with your venv path

###############################################
# Set the path to your R or Python script
###############################################

# For Python:
# PY_SCRIPT="/path/to/project/scripts/run_model.py"

# For R:
# R_SCRIPT="/path/to/project/scripts/run_analysis.R"

If you want to run Python with Conda, uncomment:

source ~/miniconda3/etc/profile.d/conda.sh
conda activate project-env
PY_SCRIPT="/path/to/project/scripts/run_model.py"

at the bottom of the script

python "$PY_SCRIPT"

If you want to run Python with venv

source .venv/bin/activate
PY_SCRIPT="/path/to/project/scripts/run_model.py"

then

python "$PY_SCRIPT"

same goes to R what changes

Rscript "$R_SCRIPT"

🍃 Tip You can use the Sbatch Builder to help you come up with a quick fast header.

CPUs: What You Ask For vs What Gets Used

#SBATCH --cpus-per-task=8    # reserves 8 cores on the node for this task
#SBATCH --ntasks=1           # 1 task (process)
  • --cpus-per-task controls how many cores SLURM reserves — it directly affects node placement.
  • R and Python are single-threaded by default. You are paying for cores you are not using.
  • But the number you request controls how many tasks land on each node. This is critical for job arrays.

Rule of thumb: tune --cpus-per-task to match both your software behavior and your site policy.

cpus-per-taskMax tasks per nodeSpread behaviour
1site-specificOften packs many tasks onto one node
4site-specificModerate spread
8site-specificWider spread
full nodesite-specificExclusive or near-exclusive placement

If you have any shared I/O between tasks, request 8+ CPUs per task to force SLURM to spread tasks across nodes and avoid resource contention — even if your process only uses 1 core.

Job Arrays is the right way to parallelise

Use arrays when you have N independent jobs that differ only by an index (different starting points, folds, seeds, etc.).

#SBATCH --array=1-8                  # runs tasks 1, 2, 3 ... 8
#SBATCH --array=1-8%4                # runs at most 4 simultaneously
#SBATCH --array=1-3                  # 3 tasks (e.g. for CRS2-LM global)

Inside the script, the current task index is $SLURM_ARRAY_TASK_ID:

export START_IDX=$SLURM_ARRAY_TASK_ID
export TASK_RUNS_DIR="runs/algo_s${SLURM_ARRAY_TASK_ID}"
mkdir -p "$TASK_RUNS_DIR"
Rscript --vanilla scripts/my_calibration.R

Log naming for arrays — use both job array ID (%A) and task ID (%a):

#SBATCH --output=logs/myjob_%A_%a.log
#SBATCH --error=logs/myjob_%A_%a.err

This gives you logs/myjob_12345_1.log, logs/myjob_12345_2.log, etc. — one file per task.

Arrays are cleaner than N copied scripts and avoid the wall-time problem: instead of one job running N starts sequentially (hitting the wall-time limit after 2), you run N jobs in parallel — each start on its own node, each using only the time it actually needs.

Excluding Problematic Nodes

If specific nodes are overloaded or have known issues, exclude them explicitly:

#SBATCH --exclude=cpu003,cpu008

Check which nodes are fully allocated before submitting:

sinfo -o '%N %t %C'
# Look for 'alloc' state = zero idle CPUs

Dependencies — Chaining Jobs

job1=$(sbatch --parsable preprocess.sh)
sbatch --dependency=afterok:${job1} analyze.sh
  • afterok: only run if previous job succeeded (exit 0)
  • afterany: run regardless of outcome
  • afternotok: only run if previous job failed

Useful for: run calibration → then automatically run validation when calibration finishes.

Checking What Actually Ran

# Summary of last 24h
sacct -u $USER --starttime=now-24hours \
  --format=JobID,JobName,Partition,Account,State,ExitCode,Elapsed -X

# Detailed resource usage for a specific job
sacct -j 12345 --format=JobID,CPUTime,MaxRSS,Elapsed,State

# All tasks of an array
sacct -j 12345 --format=JobID,State,ExitCode,Elapsed

Diagnosing Failures

SymptomCauseFix
Exit code 127Rscript / python not in PATHAdd module load R / module load miniconda3
AssocMaxWallDurationPerJobLimit--time exceeds account defaultAdd the correct --account
Job runs longer than expected then TIMEOUTWall time set too lowIncrease --time within your account limits
All array tasks on one node, contention--cpus-per-task=1 causes SLURM packingUse --cpus-per-task=8 to force spreading
Apptainer file conflictsTasks sharing runs/ directoryGive each task runs/${ALGO}_s${SLURM_ARRAY_TASK_ID}/
Objective = InfAPSIM wrapper crash, params not passedCheck .err log; verify APSIM_EXE path exists
PENDING foreverNode unavailable or resource mismatchCheck squeue reason column; try excluding problematic nodes

Minimal Working Template

#!/bin/bash
#SBATCH --job-name=my_array
#SBATCH --output=logs/%x_%A_%a.log
#SBATCH --error=logs/%x_%A_%a.err
#SBATCH --array=1-8
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=2-00:00:00
#SBATCH --partition=cpu
#SBATCH --account=project123
#SBATCH --exclude=cpu003,cpu008

set -euo pipefail
module load R
module load singularity 2>/dev/null || true

cd /path/to/project || exit 1
mkdir -p logs output runs/task_${SLURM_ARRAY_TASK_ID}

export APSIM_EXE="/path/to/project/apsim_wrapper.sh"
export TASK_RUNS_DIR="runs/task_${SLURM_ARRAY_TASK_ID}"
export START_IDX=$SLURM_ARRAY_TASK_ID

echo "Task $START_IDX  Node=$SLURMD_NODENAME  Start=$(date)"
Rscript --vanilla scripts/my_script.R
echo "Done $START_IDX  $(date)"

Quick Reference

# Submit
sbatch run_job.sh

# Monitor
squeue -u $USER -o '%.10i %.14j %.8T %.10M %N'
sacct -u $USER --starttime=now-24hours --format=JobID,JobName,State,ExitCode,Elapsed -X

# Cancel
scancel 12345                    # one job
scancel 12345 12346 12347        # several at once

# Check node availability
sinfo -o '%N %t %C'              # CPUS: Allocated/Idle/Other/Total

# Check your account limits
sacctmgr show association user=$USER format=account,partition,qos,maxwall

Sbatch Builder

This tool generates a starter sbatch script in the browser.

It starts from the same minimal array-job template shown in the Slurm guide, so the defaults are already shaped like a working example rather than a blank form.

It does not submit jobs. It does not validate partition names. It does not know your cluster better than you do.

Treat it as a fast draft generator that still needs site-specific values from your own cluster.