PARADOX-III is a High Performance Computing (HPC) cluster at the Scientific Computing Laboratory of Intitute of Physics Belgrade consisting of 41 compute nodes with 328 CPU cores in total, a shared nfs storage and gigabit ethernet interconnect.
Each compute node has the following specifications:
Spec. | Value |
---|---|
CPU | Intel Xeon E5345 @2.33GHz |
number of CPU cores | 8 |
RAM | 16 GB |
Local disk size | 100 GB |
The nodes are connected via 1 gigabit ethernet, and they have access to 15 TB of shared raid6 storage for /home directories, mounted via NFS.
The primary way of access to PARADOX-III cluster is via secure shell (ssh). Users on Linux and MacOS already have a client accessible via system terminal. On Windows there are several options available, the most popular is Putty, and we also recommend SmarTTY which has integrated X11 forwarding and file upload/download interface.
Depending on network you are accessing from, there are two ways to connect to the cluster. If you are connecting from within the Institute of Physics Belgrade local network, the cluster head node is directly accessible on p3.ipb.ac.rs
.
Example: connecting to PARADOX-III from the Institute local network:
However, if you are connecting from outside network (internet access), you must connect to the Gateway machine (gw) first on gw.ipb.ac.rs
, and selecting option #2 on the menu to connect to PARADOX-III cluster.
Example: connecting to PARADOX-III over the Internet:
The gateway will offer a menu with following choice:
Welcome to PARADOX Cluster. This is the gateway machine for remote
PARADOX access.
Would you like to:
1. Connect to PARADOX-IV cluster
2. Connect to PARADOX-III cluster
3. Connect to PARADOX Hadoop
4. Continue working in gateway (this machine)
Please select one of the above (1-4):
After entering 2 and pressing enter you will be logged into p3.ipb.ac.rs
which is the PARADOX-III head node. From there you will have access to the SLURM batch manager which is the main system that you will be interacting with as you submit jobs and control your computations.
Also, if you would like to use GUI applications, you can forward their interface to your machine by adding -X
flag to your ssh command, i.e. ssh username@gw.ipb.ac.rs -X
if you are connecting from outside IPB, or ssh username@p3.ipb.ac.rs -X
from local IPB network.
The easiest way to copy a file to or from PARADOX-III cluster is to use the scp
command on Unix systems, and on Windows we’ve already mentioned that SmartTTY has interface for file transfer, but you can also try WinSCP.
The scp
command functions like copy (cp
) command as it takes two arguments – source and destination. Either source or destination or both can refer to a local file or a file on a remote machine. Local file paths are specified in the standard manner, but the syntax for remote paths is: username@p3:path/on/the/remote/host
. The path on the remote host can be absolute (i.e. if it starts with /) or relative in which case it is relative to user’s home directory.
Example: Transfering a file from your pc to PARADOX-III user’s home folder:
Note: If you want to transfer files from a machine outside the IPB’s local network, you should execute the scp command from the p3
head node like the following:
If the remote machine is not running ssh server, then you will need to transfer the files first to gw.ipb.ac.rs
and then from there to p3
.
After connecting to the gateway(gw.ipb.ac.rs
) and choosing option #4 to stay on gateway, you can change your password by issuing command passwd
.
To logout from p3.ipb.ac.rs
or gw.ipb.ac.rs
, press Ctrl+d in your terminal, or type exit.
The two central parts of each HPC software stack are the batch scheduler and resource manager. All the work that the users need to be done is divided into jobs. The batch scheduler orders these jobs according to available resources so that a job can start executing as soon as the sufficient resources are available. The resource manager then assigns the required resources for the specified amount of time and starts the execution of the job. If the job is interactive, the user is logged into the assigned resource and given direct control of the execution. Otherwise, the job is running in batch mode and can be controlled using the SLURM commands that will be described in the Job management section.
The main unit of work that users submit to the cluster to execute is a job. It defines the computation that needs to be done and the resources it requires.
The following table gives a quick overview of SLURM commands users should be familiar with. In the terminal you can always get more details on any of these commands by typing man <command_name>
.
Command | Description |
---|---|
sbatch |
submit a job |
salloc |
allocate resources |
srun |
run interactive job |
squeue |
display all jobs in queue |
sinfo |
show the state of currently defined partitions |
A batch job is the most commonly used type of job on an HPC system. It is defined in a text file which is called the job script. The script contains the information about which interpreter is used to execute it, a number of #SBATCH
directives that specify the resources needed by the job, and the application that will be invoked to execute the computation needed.
Depending on parallelization capability of your computation, you can run:
The following examples will cover each use case.
To prepare a serial job launch for application called my_app
, create a plain text file, e.g. serial_job.sh, with the following contents:
#!/bin/bash
#SBATCH --job-name=serial-example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100MB
#SBATCH --time=01:00:00
#SBATCH --output=run-%j.out
#SBATCH --error=run-%j.err
./my_app
To launch this job, you can use the sbatch command:
The lines that begin with #SBATCH
are interpreted as the arguments to the salloc
command and they define the allocation parameter that the job requires. The following table gives a short overview of these parameters and more details about parameters can be found on the sbatch man page.
Parameter | Description |
---|---|
--job-name |
gives a readable name to the job. |
--nodes |
number of compute nodes to allocate. |
--ntasks-per-node |
The number of parallel processes to allow to run per allocated node. |
--cpus-per-task |
The number of CPU cores to allocate per task. |
--mem-per-cpu |
The amount of RAM to dedicate to each allocated CPU core.PARADOX-III compute nodes have 2 GB of RAM per CPU core, so if you allocate more than that the number of such parallel processes that will fit on the node will be less than 8. |
--time |
Wall time limit on the running time of the job. |
--output |
Name pattern for the file that will hold standard output of the run. |
--error |
Name pattern for the file that will hold standard error of the run. |
Shared memory parallel applications run on one machine using multiple CPU cores. On the low level this can be achieved using operating system’s threads, e.g. using pthreads library. All the execution threads see the same area of system RAM and they can comunicate by directly accessing and modifying the same memory.
A popular way to achieve this parallelism with a lower learning curve and less modification required for existing code is to use OpenMP. It relies on preprocessor directives to help the compiler semi-automatically generate parallel code from loops and other parallelizable parts of the code.
In the rest of this section we will use the following example (omp_example.c):
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
/* Obtain thread number */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
}
On PARADOX-III we have several compiler toolchains which support OpenMP for C/C++ and Fortran. You can view the current list of toolchains using module avail
command, more on which you can read in section 5, on programming environment. Different compilers have different parameters that you need to pass in to activate OpenMP, and the overview is given in the following list:
-fopenmp
-openmp
-mp
For the purposes of this example we will use foss toolchain, that contains the gcc compiler, so the corresponding steps to compile our example are:
This results in omp_example
executable in our current directory, and all we have yet to prepare is the job script that will define our job submission:
#!/bin/bash
#SBATCH --job-name=openmp-example
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=100MB
#SBATCH --time=00:01:00
#SBATCH --output=run-%j.out
#SBATCH --error=run-%j.err
export OMP_NUM_THREADS=8
./omp_example
This job scrit is not very different from our general serial example, except that we’ve specified 8 cpus per task to be used, and we have exported an environment variable OMP_NUM_THREADS
which OpenMP reads to set the number of threads for parallel execution sections.
After executing sbatch ompex-job.sh
our current directory should have files beginning with run-
and ending in .out
and .err
. These are standard output and error respectively, and output should contain something similar to the following:
Hello World from thread = 0
Number of threads = 8
Hello World from thread = 7
Hello World from thread = 6
Hello World from thread = 5
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 2
Hello World from thread = 1
For larger scale parallelism, where multiple compute nodes can be used, the common approach is to use message passing paradigm, i.e. MPI. This approach functions such that the application is launched in multiple processes which can communicate over the network, both within the same node or on different nodes. The name comes from this paradigm where processes exchange data, i.e. messages, to work together on a problem. For more details on MPI and parallel programming in general, we recommend the following resources: LLNL Introduction to Parallel Computation and Introduction to Parallel Computing, Grama et al..
On PARADOX-III there are several MPI library and compiler combinations supported. They can be loaded as part of a toolchain, which ensures that all the loaded libraries are appropriate for the chosen compiler. In this example, we will use the following code in hello-mpi.c
file:
#include <stdio.h>
#include <mpi.h>
main(int argc, char **argv)
{
int num_procs, my_id;
int len;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
/* find out process ID, and how many processes were started. */
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Get_processor_name(name, &len);//
printf("Hello, world. I'm process %d of %d on %s\n", my_id,
num_procs, name);
MPI_Finalize();
}
To compile the example we will use foss toolchain which contains the gcc compiler and appropriate openmpi library:
To define the job for submission of this hello-mpi application, we can use the hellompi.sh
job script with the following contents:
#!/bin/bash
#SBATCH --job-name=mpi-example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100MB
#SBATCH --time=00:01:00
#SBATCH --output=run-%j.out
#SBATCH --error=run-%j.err
module load foss
chmod +x ./hello-mpi
srun ./hello-mpi
The #SBATCH
directives that controll the number of nodes and tasks per node, --nodes
and --ntasks-per-node
, are used by slurm to automatically pass the number of proceses and their topology to the underlying mpirun call. This is wrapped in the srun call on the last line, so that the user does not have to manually specify the dimensions of mpi launch, as well as to avoid the need to repeat themselves. Of course, mpirun can be called directly if instead of the srun, we make the last line: mpirun -np 16 -npernode 8 ./hellompi
.
To submit the job, we again call the sbatch command:
This will print out the job id and after the job finishes, run-{JOBID}.out and run-{JOBID}.err files will be present in the current directory. If all went well, the .err file will be empty, and the .out file will hold the standard output data, similar to the following:
Hello, world.I'm process 4 of 16 on p3w00
Hello, world.I'm process 7 of 16 on p3w00
Hello, world.I'm process 3 of 16 on p3w00
Hello, world.I'm process 0 of 16 on p3w00
Hello, world.I'm process 2 of 16 on p3w00
Hello, world.I'm process 6 of 16 on p3w00
Hello, world.I'm process 1 of 16 on p3w00
Hello, world.I'm process 5 of 16 on p3w00
Hello, world.I'm process 12 of 16 on p3w01
Hello, world.I'm process 8 of 16 on p3w01
Hello, world.I'm process 10 of 16 on p3w01
Hello, world.I'm process 11 of 16 on p3w01
Hello, world.I'm process 13 of 16 on p3w01
Hello, world.I'm process 14 of 16 on p3w01
Hello, world.I'm process 15 of 16 on p3w01
Hello, world.I'm process 9 of 16 on p3w01
An interactive job provides users with direct access to compute nodes. The launch of an interactive job includes two steps:
salloc
command, which takes the same arguments as specified in #SBATCH
lines of a job script;srun
is used to launch a shell on that allocation.The example procedure to allocate one whole node looks like this:
user@p3:~$ salloc --job-name='interactive job' --cpus-per-task=8 --mem-per-cpu=1G --time=24:00:00
salloc: Granted job allocation 24
user@p3:~$ srun --pty bash
user@p3w12:~$ hostname
p3w12
user@p3w12:~$ exit
exit
user@p3:~$ exit
exit
salloc: Relinquishing job allocation 24
user@p3:~$
The entire bash prompt is included in the log of this session to distinguish which command is being executed on which machine. As you can see, the allocation is created using salloc
on the p3 (head node) and after srun
launches the bash shell, the shell is executing on the worker node (in this case p3w12). hostname
command is called to demonstrate that the session is indeed on the worker node.
The first exit command terminates the bash session on the worker node, and the second exit terminates the shell started by the allocation, thus ending the interactive job.
Job arrays can be used when a number of similar runs needs to be submitted. The max number of jobs in an array is subject to partition and account limits. When an array of jobs is submitted, all jobs will have the same JobID, but each individual job will have their own ArrayJobID.
Only batch jobs can be submitted as arrays, and they must have the same values for options specified in #SBATCH
lines, i.e. job size, the time limit, etc. The following example job script will print out the job id and the job array index, from every job in the array.
#!/bin/bash
#SBATCH --job-name=array-test
#SBATCH --time=00:00:30
echo "Job id: ${SLURM_JOB_ID}, array id: ${SLURM_ARRAY_TASK_ID}, task count: ${SLURM_ARRAY_TASK_COUNT}"
To launch the job, we specify the --array
or -a
parameter to the sbatch command. The value of the parameter is the specification of a range of job indices to create. In the example below, the range specied will launch job array with 11 jobs with indices from 0 to 11.
The job submission above will yield the output similar to the following:
Job id: 37, array id: 0, task count: 11
Job id: 36, array id: 10, task count: 11
Job id: 38, array id: 1, task count: 11
Job id: 39, array id: 2, task count: 11
Job id: 40, array id: 3, task count: 11
Job id: 41, array id: 4, task count: 11
Job id: 42, array id: 5, task count: 11
Job id: 43, array id: 6, task count: 11
Job id: 44, array id: 7, task count: 11
Job id: 45, array id: 8, task count: 11
Job id: 46, array id: 9, task count: 11
More on job arrays in SLURM can be found on the official SLURM documentation page.
High level overview of running and pending jobs can be displayed using the squeue
command. The syntax is following:
Common command line options are:
Option | Description |
---|---|
--user=<USER_IDS> |
Show data for a single user or comma-separated list of users. |
--job=<JOB_IDS> |
Show data for a specific job ID, or comma-separated list of job IDs. |
--partition=<PARTITIONS> |
Show data for all jobs running on a single partition or a comma-separated list of partitions. |
--state=<STATES> |
Show data on jobs that are in one of the job states given as comma-separated list. |
Job states can have following values:
PD
: PendingR
: RunningS
: SuspendedCA
: CancelledCG
: CompletingCD
: CompletedF
: FailedFor more information on this command and available parameters, please consult the man page.
This command shows status information on partitions (i.e. queues in pbs parlance).
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
standard* up infinite 5 down* p3w[03-04,11,13-14]
standard* up infinite 36 idle p3w[00-02,05-10,12,15-40]
The scancel command is used to stop a submitted job. It only takes the job id as the parameter.
Also, this command can send a signal (such as SIGTERM
or SIGKILL
) to the job with the -s
parameter:
scontrol
comman can be used to modify attributes of a submitted job, such as partition, job size, limits, etc.
This command can also be used to hold or release a job:
More on the scontrol
command can be found in the man page. Note that some operations with scontrol
can require super user acces, and can not be executed by regular users.
This command displays more information on jobs and job steps. Use --format
option for customized view.
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
131 test standard devops 2 RUNNING 0:0
132 mpi-e standard devops 32 FAILED 1:0
Multiple programming environments are supported on PARADOX-III. This means users can choose between available compiler stacks, whether they’ll use GNU, Intel or PGI compilers, or between versions of various libraries. This flexibility is provided by the module
command which enables users to easily control
Paradox cluster uses Lmod Modules to set up user environment variables to easily switch and use various compiler toolchains, libraries, scientific applications, etc.
Toolchains represent collection of a compiler, mpi library, and a number of numerical libraries which were all optimized and built with the selected compiler. This simplifies the choice between various versions of libraries.
Available modules can be listed with following command:
When a toolchain or compiler module is loaded, modules for all the libraries and tools build with the toolchain/compiler that depend on it, become shown in the available modules list.
The list of currently loaded modules can be brought up by tying:
Each module can be loaded by executing:
Specific modules can be unloaded by calling:
All modules can be unloaded by executing:
Short description known as “whatis”:
To list all possible modules:
More detailed info about a particular module: