PARADOX-III User Guide

Introduction

PARADOX-III is a High Performance Computing (HPC) cluster at the Scientific Computing Laboratory of Intitute of Physics Belgrade consisting of 41 compute nodes with 328 CPU cores in total, a shared nfs storage and gigabit ethernet interconnect.

Hardware info

Each compute node has the following specifications:

Spec.	Value
CPU	Intel Xeon E5345 @2.33GHz
number of CPU cores	8
RAM	16 GB
Local disk size	100 GB

The nodes are connected via 1 gigabit ethernet, and they have access to 15 TB of shared raid6 storage for /home directories, mounted via NFS.

Logging in

The primary way of access to PARADOX-III cluster is via secure shell (ssh). Users on Linux and MacOS already have a client accessible via system terminal. On Windows there are several options available, the most popular is Putty, and we also recommend SmarTTY which has integrated X11 forwarding and file upload/download interface.

Depending on network you are accessing from, there are two ways to connect to the cluster. If you are connecting from within the Institute of Physics Belgrade local network, the cluster head node is directly accessible on p3.ipb.ac.rs.

Example: connecting to PARADOX-III from the Institute local network:

ssh username@p3.ipb.ac.rs

However, if you are connecting from outside network (internet access), you must connect to the Gateway machine (gw) first on gw.ipb.ac.rs, and selecting option #2 on the menu to connect to PARADOX-III cluster.

Example: connecting to PARADOX-III over the Internet:

ssh username@gw.ipb.ac.rs

The gateway will offer a menu with following choice:

Welcome to PARADOX Cluster. This is the gateway machine for remote 
PARADOX access.


Would you like to:

        1. Connect to PARADOX-IV cluster
        2. Connect to PARADOX-III cluster
        3. Connect to PARADOX Hadoop
        4. Continue working in gateway (this machine)

Please select one of the above (1-4):

After entering 2 and pressing enter you will be logged into p3.ipb.ac.rs which is the PARADOX-III head node. From there you will have access to the SLURM batch manager which is the main system that you will be interacting with as you submit jobs and control your computations.

Also, if you would like to use GUI applications, you can forward their interface to your machine by adding -X flag to your ssh command, i.e. ssh username@gw.ipb.ac.rs -X if you are connecting from outside IPB, or ssh username@p3.ipb.ac.rs -X from local IPB network.

Copying files to and from the cluster

The easiest way to copy a file to or from PARADOX-III cluster is to use the scp command on Unix systems, and on Windows we’ve already mentioned that SmartTTY has interface for file transfer, but you can also try WinSCP.

The scp command functions like copy (cp) command as it takes two arguments – source and destination. Either source or destination or both can refer to a local file or a file on a remote machine. Local file paths are specified in the standard manner, but the syntax for remote paths is: username@p3:path/on/the/remote/host. The path on the remote host can be absolute (i.e. if it starts with /) or relative in which case it is relative to user’s home directory.

Example: Transfering a file from your pc to PARADOX-III user’s home folder:

$ scp  my/path/to/file  username@p3.ipb.ac.rs:

Note: If you want to transfer files from a machine outside the IPB’s local network, you should execute the scp command from the p3 head node like the following:

user@p3:~$ scp user@my.outside.machine:/file/path path/on/p3

If the remote machine is not running ssh server, then you will need to transfer the files first to gw.ipb.ac.rs and then from there to p3.

Change password

After connecting to the gateway(gw.ipb.ac.rs) and choosing option #4 to stay on gateway, you can change your password by issuing command passwd.

Logging out

To logout from p3.ipb.ac.rs or gw.ipb.ac.rs, press Ctrl+d in your terminal, or type exit.

SLURM concepts

The two central parts of each HPC software stack are the batch scheduler and resource manager. All the work that the users need to be done is divided into jobs. The batch scheduler orders these jobs according to available resources so that a job can start executing as soon as the sufficient resources are available. The resource manager then assigns the required resources for the specified amount of time and starts the execution of the job. If the job is interactive, the user is logged into the assigned resource and given direct control of the execution. Otherwise, the job is running in batch mode and can be controlled using the SLURM commands that will be described in the Job management section.

Job and job steps

The main unit of work that users submit to the cluster to execute is a job. It defines the computation that needs to be done and the resources it requires.

Commands overview

The following table gives a quick overview of SLURM commands users should be familiar with. In the terminal you can always get more details on any of these commands by typing man <command_name>.

Command	Description
`sbatch`	submit a job
`salloc`	allocate resources
`srun`	run interactive job
`squeue`	display all jobs in queue
`sinfo`	show the state of currently defined partitions

Job submission

A batch job is the most commonly used type of job on an HPC system. It is defined in a text file which is called the job script. The script contains the information about which interpreter is used to execute it, a number of #SBATCH directives that specify the resources needed by the job, and the application that will be invoked to execute the computation needed.

Depending on parallelization capability of your computation, you can run:

serial jobs which have no parallelization
shared memory parallel jobs with OpenMP or threads on a single node
MPI parallel jobs which can span multiple nodes
hybrid jobs, which combine OpenMP and MPI parallelization approach.

The following examples will cover each use case.

Example: serial job

To prepare a serial job launch for application called my_app, create a plain text file, e.g. serial_job.sh, with the following contents:

#!/bin/bash
#SBATCH --job-name=serial-example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100MB
#SBATCH --time=01:00:00
#SBATCH --output=run-%j.out
#SBATCH --error=run-%j.err

./my_app

To launch this job, you can use the sbatch command:

$ sbatch serial_job.sh

The lines that begin with #SBATCH are interpreted as the arguments to the salloc command and they define the allocation parameter that the job requires. The following table gives a short overview of these parameters and more details about parameters can be found on the sbatch man page.

Parameter	Description
`--job-name`	gives a readable name to the job.
`--nodes`	number of compute nodes to allocate.
`--ntasks-per-node`	The number of parallel processes to allow to run per allocated node.
`--cpus-per-task`	The number of CPU cores to allocate per task.
`--mem-per-cpu`	The amount of RAM to dedicate to each allocated CPU core.PARADOX-III compute nodes have 2 GB of RAM per CPU core, so if you allocate more than that the number of such parallel processes that will fit on the node will be less than 8.
`--time`	Wall time limit on the running time of the job.
`--output`	Name pattern for the file that will hold standard output of the run.
`--error`	Name pattern for the file that will hold standard error of the run.

Example: shared memory parallel job with OpenMP

Shared memory parallel applications run on one machine using multiple CPU cores. On the low level this can be achieved using operating system’s threads, e.g. using pthreads library. All the execution threads see the same area of system RAM and they can comunicate by directly accessing and modifying the same memory.

A popular way to achieve this parallelism with a lower learning curve and less modification required for existing code is to use OpenMP. It relies on preprocessor directives to help the compiler semi-automatically generate parallel code from loops and other parallelizable parts of the code.

In the rest of this section we will use the following example (omp_example.c):

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
    int nthreads, tid;
    /* Fork a team of threads giving them their own copies of variables */
    
    #pragma omp parallel private(nthreads, tid)
    {
        /* Obtain thread number */
        tid = omp_get_thread_num();
        printf("Hello World from thread = %d\n", tid);
        
        /* Only master thread does this */
        if (tid == 0)
        {
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
        }
    }   /* All threads join master thread and disband */
}

On PARADOX-III we have several compiler toolchains which support OpenMP for C/C++ and Fortran. You can view the current list of toolchains using module avail command, more on which you can read in section 5, on programming environment. Different compilers have different parameters that you need to pass in to activate OpenMP, and the overview is given in the following list:

GCC: -fopenmp
Intel: -openmp
PGI: -mp

For the purposes of this example we will use foss toolchain, that contains the gcc compiler, so the corresponding steps to compile our example are:

module load foss
gcc -fopenmp omp_example.c -o omp_example

This results in omp_example executable in our current directory, and all we have yet to prepare is the job script that will define our job submission:

#!/bin/bash
#SBATCH --job-name=openmp-example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=100MB
#SBATCH --time=00:01:00
#SBATCH --output=run-%j.out
#SBATCH --error=run-%j.err

export OMP_NUM_THREADS=8
./omp_example

This job scrit is not very different from our general serial example, except that we’ve specified 8 cpus per task to be used, and we have exported an environment variable OMP_NUM_THREADS which OpenMP reads to set the number of threads for parallel execution sections.

After executing sbatch ompex-job.sh our current directory should have files beginning with run- and ending in .out and .err. These are standard output and error respectively, and output should contain something similar to the following:

Hello World from thread = 0
Number of threads = 8
Hello World from thread = 7
Hello World from thread = 6
Hello World from thread = 5
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 2
Hello World from thread = 1

Example: MPI job

For larger scale parallelism, where multiple compute nodes can be used, the common approach is to use message passing paradigm, i.e. MPI. This approach functions such that the application is launched in multiple processes which can communicate over the network, both within the same node or on different nodes. The name comes from this paradigm where processes exchange data, i.e. messages, to work together on a problem. For more details on MPI and parallel programming in general, we recommend the following resources: LLNL Introduction to Parallel Computation and Introduction to Parallel Computing, Grama et al..

On PARADOX-III there are several MPI library and compiler combinations supported. They can be loaded as part of a toolchain, which ensures that all the loaded libraries are appropriate for the chosen compiler. In this example, we will use the following code in hello-mpi.c file:

#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv)
{
    int num_procs, my_id;
    int len;
    char name[MPI_MAX_PROCESSOR_NAME];
    MPI_Init(&argc, &argv);
    
    /* find out process ID, and how many processes were started. */
    MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
    MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
    MPI_Get_processor_name(name, &len);//
    printf("Hello, world. I'm process %d of %d on %s\n", my_id,
           num_procs, name);
    
    MPI_Finalize();
}

To compile the example we will use foss toolchain which contains the gcc compiler and appropriate openmpi library:

module load foss
mpicc hello-mpi.c -o hello-mpi

To define the job for submission of this hello-mpi application, we can use the hellompi.sh job script with the following contents:

#!/bin/bash
#SBATCH --job-name=mpi-example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100MB
#SBATCH --time=00:01:00
#SBATCH --output=run-%j.out
#SBATCH --error=run-%j.err

module load foss
chmod +x ./hello-mpi

srun ./hello-mpi

The #SBATCH directives that controll the number of nodes and tasks per node, --nodes and --ntasks-per-node, are used by slurm to automatically pass the number of proceses and their topology to the underlying mpirun call. This is wrapped in the srun call on the last line, so that the user does not have to manually specify the dimensions of mpi launch, as well as to avoid the need to repeat themselves. Of course, mpirun can be called directly if instead of the srun, we make the last line: mpirun -np 16 -npernode 8 ./hellompi.

To submit the job, we again call the sbatch command:

sbatch hellompi.sh

This will print out the job id and after the job finishes, run-{JOBID}.out and run-{JOBID}.err files will be present in the current directory. If all went well, the .err file will be empty, and the .out file will hold the standard output data, similar to the following:

Hello, world.I'm process 4 of 16 on p3w00
Hello, world.I'm process 7 of 16 on p3w00
Hello, world.I'm process 3 of 16 on p3w00
Hello, world.I'm process 0 of 16 on p3w00
Hello, world.I'm process 2 of 16 on p3w00
Hello, world.I'm process 6 of 16 on p3w00
Hello, world.I'm process 1 of 16 on p3w00
Hello, world.I'm process 5 of 16 on p3w00
Hello, world.I'm process 12 of 16 on p3w01
Hello, world.I'm process 8 of 16 on p3w01
Hello, world.I'm process 10 of 16 on p3w01
Hello, world.I'm process 11 of 16 on p3w01
Hello, world.I'm process 13 of 16 on p3w01
Hello, world.I'm process 14 of 16 on p3w01
Hello, world.I'm process 15 of 16 on p3w01
Hello, world.I'm process 9 of 16 on p3w01

Interactive job

An interactive job provides users with direct access to compute nodes. The launch of an interactive job includes two steps:

first the job allocation is specified using salloc command, which takes the same arguments as specified in #SBATCH lines of a job script;
then srun is used to launch a shell on that allocation.

The example procedure to allocate one whole node looks like this:

user@p3:~$ salloc --job-name='interactive job' --cpus-per-task=8 --mem-per-cpu=1G --time=24:00:00
salloc: Granted job allocation 24
user@p3:~$ srun --pty bash
user@p3w12:~$ hostname
p3w12
user@p3w12:~$ exit
exit
user@p3:~$ exit
exit
salloc: Relinquishing job allocation 24
user@p3:~$

The entire bash prompt is included in the log of this session to distinguish which command is being executed on which machine. As you can see, the allocation is created using salloc on the p3 (head node) and after srun launches the bash shell, the shell is executing on the worker node (in this case p3w12). hostname command is called to demonstrate that the session is indeed on the worker node.

The first exit command terminates the bash session on the worker node, and the second exit terminates the shell started by the allocation, thus ending the interactive job.

Job array

Job arrays can be used when a number of similar runs needs to be submitted. The max number of jobs in an array is subject to partition and account limits. When an array of jobs is submitted, all jobs will have the same JobID, but each individual job will have their own ArrayJobID.

Only batch jobs can be submitted as arrays, and they must have the same values for options specified in #SBATCH lines, i.e. job size, the time limit, etc. The following example job script will print out the job id and the job array index, from every job in the array.

#!/bin/bash
#SBATCH --job-name=array-test
#SBATCH --time=00:00:30

echo "Job id: ${SLURM_JOB_ID}, array id: ${SLURM_ARRAY_TASK_ID}, task count: ${SLURM_ARRAY_TASK_COUNT}"

To launch the job, we specify the --array or -a parameter to the sbatch command. The value of the parameter is the specification of a range of job indices to create. In the example below, the range specied will launch job array with 11 jobs with indices from 0 to 11.

sbatch --array=0-10 job.sh

The job submission above will yield the output similar to the following:

Job id: 37, array id: 0, task count: 11
Job id: 36, array id: 10, task count: 11
Job id: 38, array id: 1, task count: 11
Job id: 39, array id: 2, task count: 11
Job id: 40, array id: 3, task count: 11
Job id: 41, array id: 4, task count: 11
Job id: 42, array id: 5, task count: 11
Job id: 43, array id: 6, task count: 11
Job id: 44, array id: 7, task count: 11
Job id: 45, array id: 8, task count: 11
Job id: 46, array id: 9, task count: 11

More on job arrays in SLURM can be found on the official SLURM documentation page.

Job management

Monitoring

squeue

High level overview of running and pending jobs can be displayed using the squeue command. The syntax is following:

squeue [options]

Common command line options are:

Option	Description
`--user=<USER_IDS>`	Show data for a single user or comma-separated list of users.
`--job=<JOB_IDS>`	Show data for a specific job ID, or comma-separated list of job IDs.
`--partition=<PARTITIONS>`	Show data for all jobs running on a single partition or a comma-separated list of partitions.
`--state=<STATES>`	Show data on jobs that are in one of the job states given as comma-separated list.

Job states can have following values:

PD: Pending
R: Running
S: Suspended
CA: Cancelled
CG: Completing
CD: Completed
F: Failed

For more information on this command and available parameters, please consult the man page.

sinfo

This command shows status information on partitions (i.e. queues in pbs parlance).

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up   infinite      5  down* p3w[03-04,11,13-14]
standard*    up   infinite     36   idle p3w[00-02,05-10,12,15-40]

Control

scancel

The scancel command is used to stop a submitted job. It only takes the job id as the parameter.

scancel <JOB_ID>

Also, this command can send a signal (such as SIGTERM or SIGKILL) to the job with the -s parameter:

scancel -s <signal> <JOB_ID>

scontrol

scontrol comman can be used to modify attributes of a submitted job, such as partition, job size, limits, etc.

scontrol update jobid=<JOB_ID> partition=standard

This command can also be used to hold or release a job:

scontrol hold <JOB_ID>
scontrol release <JOB_ID>

More on the scontrol command can be found in the man page. Note that some operations with scontrol can require super user acces, and can not be executed by regular users.

sacct

This command displays more information on jobs and job steps. Use --format option for customized view.

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
131                test   standard     devops          2    RUNNING      0:0
132               mpi-e   standard     devops         32     FAILED      1:0

Programming environment

Multiple programming environments are supported on PARADOX-III. This means users can choose between available compiler stacks, whether they’ll use GNU, Intel or PGI compilers, or between versions of various libraries. This flexibility is provided by the module command which enables users to easily control

Paradox cluster uses Lmod Modules to set up user environment variables to easily switch and use various compiler toolchains, libraries, scientific applications, etc.

Toolchains represent collection of a compiler, mpi library, and a number of numerical libraries which were all optimized and built with the selected compiler. This simplifies the choice between various versions of libraries.

Available modules can be listed with following command:

$ module avail

When a toolchain or compiler module is loaded, modules for all the libraries and tools build with the toolchain/compiler that depend on it, become shown in the available modules list.

The list of currently loaded modules can be brought up by tying:

$ module list

Each module can be loaded by executing:

$ module load module_name

Specific modules can be unloaded by calling:

$ module unload module_name

All modules can be unloaded by executing:

$ module purge

Short description known as “whatis”:

$ module whatis module_name1 ..

To list all possible modules:

$ module spider

More detailed info about a particular module:

$ module spider module_name

- PARADOX-III User Guide