Home

Created by Victor

This is a quick guide on how to get you started with the IUSS cluster. Its not exhaustive.

shell habits that save time
how to create environments for your projects
Slurm patterns that prevent avoidable scheduler mistakes
an sbatch builder for getting to a usable script quickly

You can download it as PDF.

Back to Parent Site

Shell Bash

The shell is the operating surface of our HPC system.

By the end of this tiny tutorial you should be able to:

know where you are
find the file
inspect the log
confirm the environment
stop retyping all commands

Request access for the HPC and follow the local onboarding instructions

Now that you are logged in.

You might ask yourself What is this thing?

The cluster uses Bash.

Don’t trust me? Check:

echo $SHELL

You should see

/bin/bash

You are dealing with Unix, which has:

kernel -> does the real work
shell -> translates your commands
programs -> the tools you run

What Actually Happens

If you run

rm myfile

You might think:

You deleted a file

Reality:

shell finds rm
kernel executes it
file disappears forever
no emotions, no undo

A few tips here and there

Type part of a command and press Tab. If it:

completes -> good
does not complete -> be more specific

You can always see the history of the commands you ran, Try it

history

up-arrow and down-arrow scroll through commands
stop retyping everything

Always know your state i.e where you are and so on. Here are the commands that can help

pwd           # where you are
echo $PATH    # where commands come from
history       # what you did
!!            # rerun the last command
!204          # rerun command number 204 from your history

Moving around

is easy

pwd
ls -lah
cd /path/to/project
cd /                  # root, top of everything
cd -                  # go back to the previous directory
cd ..                 # go up one directory
df -h                 # check disk space

Stop deleting things like a villain

The default rm:

deletes instantly
no warning
no recovery

Safer version would be if you add an alias

alias rm='rm -i'

Now you get:

remove file? y/n

Document your work but if you misplace a file you can still

Find your stuff before you panic

use a wildcard * to find all files with a particular extension. You can also target particular folders. For:

recent files -> -mtime -1
errors -> grep
fast log read -> tail
jump to end -> less +G

examples

find . -name "*.out"
find . -type f -mtime -1
grep -R "Traceback" logs/
grep -R "error\|fail\|killed" logs/
tail -n 100 slurm-12345.out
less +G logs/run.err

Read the lost and found files

by using this common commands:

cat -> small files
less -> large logs
head -> beginning
tail -> end

examples:

cat config.yaml
less slurm-12345.out
head -n 20 run.sh
tail -n 40 slurm-12345.out

Permissions and Environment Problems

check out the environments section for more information on that topic.

As for permissions they are expressed as three sets:

[user][group][others]

we have digits like 700, 750 etc. Each digit is a sum of:

4 = read (r)
2 = write (w)
1 = execute (x)

for example

Mode	Meaning
`700`	Owner: rwx, Group: —, Others: —
`750`	Owner: rwx, Group: r-x, Others: —
`755`	Owner: rwx, Group: r-x, Others: r-x
`644`	Owner: rw-, Group: r–, Others: r–

that you can interprete as follows.

700

Only you (owner) can read/write/execute rwx
Nobody else can even see inside
Good for private scripts or sensitive data

750

You: full access
Group: read + execute
Others: no access

755

Everyone can execute
Typical for public scripts and system binaries

If your script fails you might be missing the execute permission

Check the script:

ls -l run.sh

If you see:

-rw-r--r-- run.sh

No x → cannot execute

Fix:

chmod +x run.sh

Directory permissions matter too but i think you will discover this on your own.

Debug Strategy if you get an error

ls -l run.sh        # permissions
which python        # actual interpreter
module list         # loaded modules
module purge        # clean environment
echo $PATH          # execution path

Storage

You want to keep your configs at home. Create various folders for each project and use scratch for temporary data.

Check usage and limits

pwd              # where am I?
du -sh .         # size of current directory
du -sh *         # size per subdirectory (very useful)
df -h            # filesystem capacity

To move and transfer data i find it better to compress as it

reduces size
avoids slow transfers due to many inodes/filesfirst

tar -czf results.tar.gz results/

you can then extract this files

tar -xzf results.tar.gz

I prefer rsync to transfer my data.

rsync -avh project/ backup/project/

Its

incremental
preserves permissions/timestamps
resumable

Add progress to the transfer

rsync -avh --progress project/ backup/project/

and even remote sync if you wanna tranfer data from local laptop to HPC

rsync -avh project/ user@host:/path/project/

You can a;so use scp

scp file.txt user@host:/path/

If you still want more

You may consult this tutorial by Flavio Copes on freeCodeCamp:

https://www.freecodecamp.org/news/the-linux-commands-handbook/

Or download his free book from:

https://flaviocopes.com/access/

Home

Environments

This page explains how to avoid poisoning your jobs with the wrong Python interpreter, the wrong package set, or incorrect runtime assumptions. The goal is to save you a significant amount of debugging time.

Environments allow you to configure a Python setup for a specific purpose without creating conflicts between packages or package versions required by other applications. Although Python is the most common example, the same principle applies to other languages and toolchains.

You might wonder why this matters. On real systems, especially HPC clusters, you often need the same package in different versions because of dependency constraints. Without isolated environments, these requirements collide and break workflows.

On a cluster, the phrase “it worked yesterday” usually means one of the following happened:

you loaded a different module
you activated a different environment
your batch job did not inherit the environment from your interactive shell
you forgot which interpreter the job was actually using

Environments solve these problems by making your runtime explicit, reproducible, and isolated.

Always confirm what you are running:

which python
python --version
module list
echo $PATH

The main tools are:

Conda environments
Python virtual environments
containers

Conda

To create an environment:

conda create -n myenv python=3.11

You will see something like

Retrieving notices: done
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json):

You will be prompted to accept package installations make sure to accept or deny as you may will. Let that finish

Activate it:

conda activate myenv

Check what packages you have installed

conda list

Install even more beasty packages:

conda install numpy pandas
pip install xarray
conda install -c conda-forge numpy scipy xarray

if you are bored with a packages remove it

conda remove numpy

Deactivate

conda deactivate

You may end up in a situation where you have created several environments

List and find your desired environment conda env list

conda env list

Then activate it and so on and so what

If you wanna go even more geeky.You can Create an environment from an environment.yml file.

First things first environment.yml file?

Simply put this is a text file that describes:

the environment name
the channels to use
the exact packages and versions
optional pip packages
optional variables

It is the most reproducible way you can think of as it saves the information and you can come back to it if you have removed your enewctionment and recreate the same environment today, tomorrow, and next year

This also allow same environment creation with your collaborators and supervisors etc. You get the idea

lets make the file

nano environment.yaml

the you can type the information

name: myenv
channels:
  - conda-forge
  - defaults

dependencies:
  - python=3.10
  - numpy=1.26
  - pandas=2.1
  - xarray
  - matplotlib
  - pip
  - pip:
      - rasterio
      - geopandas

you can press CTRL+O enter. CTRL + X

Explanation:

name: → the environment name

channels: → where Conda should search for packages

dependencies: → Conda packages

pip: → pip-only packages

Once you have an environment.yml file:

ls 
environment.yaml

create the env

conda env create -f environment.yml

If the environment already exists and you want to update it

conda env update -f environment.yml --prune

You can recreate an environment from an existing environment Its a common thing especially if you were testing your workflow on local pc and you want to scale it up.

conda activate myenv
conda env export > environment.yml

This captures:

exact versions
build numbers
channels
pip packages

The only downside is that the exported file often contains too much hashes, local paths and depedencies. Read the file and use your expert knowledge

🐥 The Sweet Duck says: “ You can quench your thirst here

Remove it completely when you are done:

conda remove -n myenv --all

Python Virtual Environments (venv)

I hear you say you said we can use python but why the hell you are not telling us more. And you are right. You can use a plain Python environment

venv is the built‑in Python tool for creating isolated environments.It is lightweight, fast, and ideal when:

you want minimal overhead
you only need Python packages (no external libraries)
you want a reproducible environment tied to a specific Python version

Each environment contains:

its own Python interpreter
its own site-packages directory
its own pip installation

Create an environment in a directory

python3 -m venv .venv

This creates:

.venv/bin/      # Unix/MacOS executables
.venv/Scripts/  # Windows executables
.venv/lib/      # Python libraries

dont forget that you can start at your cd or give pathto/home/user/project_env

Activate

source .venv/bin/activate

Deactivate

deactivate

Install packages

pip install numpy pandas

or a specific version

pip install requests==2.31.0

you can as well upgrade a package

pip install --upgrade requests

and remove it

pip uninstall requests

of course you want to list installed packages before anything

pip list

display details of these packages

pip show numpy

before deciding as to whether you wanna install update and so on you get the idea :)

as we did above in the conda environments, You can export and reproduce environments

save the list of installed packages to requirements.txt

pip freeze > requirements.txt

Then recreate environment from requirements.txt

pip install -r requirements.txt

I know that you can be bored of this environments so purge it by simply deleting the directory:

rm -rf .venv

🐥 The Sweet Duck says: “ remember to quench your thirst here

Home

SLURM

A good place to start is to inspect the cluster you are actually using.

scontrol show node

Example summary

Partitions: cpu, gpu, long
Nodes:      cpu001-cpu008, gpu001-gpu002
CPUs/node:  inspect with `scontrol show node`
MaxArray:   inspect your local limits

If you want to know node states:

sinfo -o '%N %t %C %m'

Tip: mix = partially allocated. alloc = zero idle CPUs. Submit to a mix or idle node; SLURM handles placement.

you can as well see how many CPU are occupied and how many are free.

sinfo -N -o "%N %t %c %C %m"

If you want to narrow down to a specific node:

scontrol show node <nodename>
scontrol show node cpu003

The Most Important Distinction: Account ≠ Partition

There is a high chance that you are in different accounts with different wall-time limits:

Account	Max wall time	How to use
`default`	site-specific	Applied automatically if you do not specify
`project123`	site-specific	Must explicitly add `--account=project123`

Always confirm the right account for your project with your administrators or sacctmgr.

Check your own account associations:

sacctmgr show association user=$USER format=account,partition,qos,maxwall

If your job sits in PENDING with reason AssocMaxWallDurationPerJobLimit, your requested --time exceeds your account limit. The fix is often to use the correct account:

#SBATCH --partition=cpu
#SBATCH --account=project123
#SBATCH --time=2-00:00:00

Submit, Inspect, Cancel

sbatch run_job.sh          # submit
squeue -u $USER            # what is alive right now
sacct -j 12345 -X          # what actually happened (includes finished/failed)
scancel 12345              # cancel one job
scancel 12345 12346        # cancel multiple at once

squeue only shows running/pending jobs. Once a job ends it disappears. Use sacct to see completed, failed, timed-out jobs — always check this before concluding a job “just vanished”:

sacct -u $USER --starttime=now-24hours \
  --format=JobID,JobName,State,ExitCode,Elapsed -X

Common states in sacct:

State	Meaning	What to do
`COMPLETED`	Exited 0	Check results
`FAILED`	Non-zero exit	Read the `.err` log
`TIMEOUT`	Hit wall time	Increase `--time` or split workload
`CANCELLED`	Manually killed	Normal
`NODE_FAIL`	Hardware fault	Resubmit
`PENDING` reason `AssocMaxWallDurationPerJobLimit`	`--time` > account max	Add the correct `--account`

Required header for every job

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=logs/%x_%j.log      # %x = job name, %j = job ID
#SBATCH --error=logs/%x_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=06:00:00
#SBATCH --partition=cpu
#SBATCH --account=project123

set -euo pipefail                    # fail fast — don't silently continue on errors

mkdir -p logs                        # ensure logs directory exists
cd /path/to/project || exit 1

###############################################
# Activate your environment (choose ONE)
###############################################

### Option A: Activate a Conda environment
# module load anaconda3            
# source ~/miniconda3/etc/profile.d/conda.sh
# conda activate project-env       # replace with your environment name

### Option B: Activate a Python venv
# source .venv/bin/activate        # replace with your venv path

###############################################
# Set the path to your R or Python script
###############################################

# For Python:
# PY_SCRIPT="/path/to/project/scripts/run_model.py"

# For R:
# R_SCRIPT="/path/to/project/scripts/run_analysis.R"

If you want to run Python with Conda, uncomment:

source ~/miniconda3/etc/profile.d/conda.sh
conda activate project-env
PY_SCRIPT="/path/to/project/scripts/run_model.py"

at the bottom of the script

python "$PY_SCRIPT"

If you want to run Python with venv

source .venv/bin/activate
PY_SCRIPT="/path/to/project/scripts/run_model.py"

then

python "$PY_SCRIPT"

same goes to R what changes

Rscript "$R_SCRIPT"

🍃 Tip You can use the Sbatch Builder to help you come up with a quick fast header.

CPUs: What You Ask For vs What Gets Used

#SBATCH --cpus-per-task=8    # reserves 8 cores on the node for this task
#SBATCH --ntasks=1           # 1 task (process)

--cpus-per-task controls how many cores SLURM reserves — it directly affects node placement.
R and Python are single-threaded by default. You are paying for cores you are not using.
But the number you request controls how many tasks land on each node. This is critical for job arrays.

Rule of thumb: tune --cpus-per-task to match both your software behavior and your site policy.

cpus-per-task	Max tasks per node	Spread behaviour
1	site-specific	Often packs many tasks onto one node
4	site-specific	Moderate spread
8	site-specific	Wider spread
full node	site-specific	Exclusive or near-exclusive placement

If you have any shared I/O between tasks, request 8+ CPUs per task to force SLURM to spread tasks across nodes and avoid resource contention — even if your process only uses 1 core.

Job Arrays is the right way to parallelise

Use arrays when you have N independent jobs that differ only by an index (different starting points, folds, seeds, etc.).

#SBATCH --array=1-8                  # runs tasks 1, 2, 3 ... 8
#SBATCH --array=1-8%4                # runs at most 4 simultaneously
#SBATCH --array=1-3                  # 3 tasks (e.g. for CRS2-LM global)

Inside the script, the current task index is $SLURM_ARRAY_TASK_ID:

export START_IDX=$SLURM_ARRAY_TASK_ID
export TASK_RUNS_DIR="runs/algo_s${SLURM_ARRAY_TASK_ID}"
mkdir -p "$TASK_RUNS_DIR"
Rscript --vanilla scripts/my_calibration.R

Log naming for arrays — use both job array ID (%A) and task ID (%a):

#SBATCH --output=logs/myjob_%A_%a.log
#SBATCH --error=logs/myjob_%A_%a.err

This gives you logs/myjob_12345_1.log, logs/myjob_12345_2.log, etc. — one file per task.

Arrays are cleaner than N copied scripts and avoid the wall-time problem: instead of one job running N starts sequentially (hitting the wall-time limit after 2), you run N jobs in parallel — each start on its own node, each using only the time it actually needs.

Excluding Problematic Nodes

If specific nodes are overloaded or have known issues, exclude them explicitly:

#SBATCH --exclude=cpu003,cpu008

Check which nodes are fully allocated before submitting:

sinfo -o '%N %t %C'
# Look for 'alloc' state = zero idle CPUs

Dependencies — Chaining Jobs

job1=$(sbatch --parsable preprocess.sh)
sbatch --dependency=afterok:${job1} analyze.sh

afterok: only run if previous job succeeded (exit 0)
afterany: run regardless of outcome
afternotok: only run if previous job failed

Useful for: run calibration → then automatically run validation when calibration finishes.

Checking What Actually Ran

# Summary of last 24h
sacct -u $USER --starttime=now-24hours \
  --format=JobID,JobName,Partition,Account,State,ExitCode,Elapsed -X

# Detailed resource usage for a specific job
sacct -j 12345 --format=JobID,CPUTime,MaxRSS,Elapsed,State

# All tasks of an array
sacct -j 12345 --format=JobID,State,ExitCode,Elapsed

Diagnosing Failures

Symptom	Cause	Fix
Exit code 127	`Rscript` / `python` not in PATH	Add `module load R` / `module load miniconda3`
`AssocMaxWallDurationPerJobLimit`	`--time` exceeds account default	Add the correct `--account`
Job runs longer than expected then TIMEOUT	Wall time set too low	Increase `--time` within your account limits
All array tasks on one node, contention	`--cpus-per-task=1` causes SLURM packing	Use `--cpus-per-task=8` to force spreading
Apptainer file conflicts	Tasks sharing `runs/` directory	Give each task `runs/${ALGO}_s${SLURM_ARRAY_TASK_ID}/`
Objective = `Inf`	APSIM wrapper crash, params not passed	Check `.err` log; verify `APSIM_EXE` path exists
PENDING forever	Node unavailable or resource mismatch	Check `squeue` reason column; try excluding problematic nodes

Minimal Working Template

#!/bin/bash
#SBATCH --job-name=my_array
#SBATCH --output=logs/%x_%A_%a.log
#SBATCH --error=logs/%x_%A_%a.err
#SBATCH --array=1-8
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=2-00:00:00
#SBATCH --partition=cpu
#SBATCH --account=project123
#SBATCH --exclude=cpu003,cpu008

set -euo pipefail
module load R
module load singularity 2>/dev/null || true

cd /path/to/project || exit 1
mkdir -p logs output runs/task_${SLURM_ARRAY_TASK_ID}

export APSIM_EXE="/path/to/project/apsim_wrapper.sh"
export TASK_RUNS_DIR="runs/task_${SLURM_ARRAY_TASK_ID}"
export START_IDX=$SLURM_ARRAY_TASK_ID

echo "Task $START_IDX  Node=$SLURMD_NODENAME  Start=$(date)"
Rscript --vanilla scripts/my_script.R
echo "Done $START_IDX  $(date)"

Quick Reference

# Submit
sbatch run_job.sh

# Monitor
squeue -u $USER -o '%.10i %.14j %.8T %.10M %N'
sacct -u $USER --starttime=now-24hours --format=JobID,JobName,State,ExitCode,Elapsed -X

# Cancel
scancel 12345                    # one job
scancel 12345 12346 12347        # several at once

# Check node availability
sinfo -o '%N %t %C'              # CPUS: Allocated/Idle/Other/Total

# Check your account limits
sacctmgr show association user=$USER format=account,partition,qos,maxwall

Open Sbatch Builder Home

Sbatch Builder

This tool generates a starter sbatch script in the browser.

It starts from the same minimal array-job template shown in the Slurm guide, so the defaults are already shaped like a working example rather than a blank form.

It does not submit jobs. It does not validate partition names. It does not know your cluster better than you do.

Treat it as a fast draft generator that still needs site-specific values from your own cluster.

Home

HPC Guide

Home

Created by Victor

Main Sections

Shell Bash

What Actually Happens

A few tips here and there

Moving around

Level Up Using Aliases

Stop deleting things like a villain

Find your stuff before you panic

Read the lost and found files

Permissions and Environment Problems

Storage

If you still want more

Environments

Conda

Python Virtual Environments (venv)

SLURM

The Most Important Distinction: Account ≠ Partition

Submit, Inspect, Cancel

CPUs: What You Ask For vs What Gets Used

Job Arrays is the right way to parallelise

Excluding Problematic Nodes

Dependencies — Chaining Jobs

Checking What Actually Ran

Diagnosing Failures

Minimal Working Template

Quick Reference

Sbatch Builder

Keyboard shortcuts

HPC Guide