vignettes/working-with-slurm.Rmd
working-with-slurm.Rmd
Nowadays, high-performance-computing (HPC) clusters are commonly available tools for either in or out of cloud settings. Slurm Work Manager (formerly Simple Linux Utility for Resource Manager) is a program written in C that is used to efficiently manage resources in HPC clusters. The slurmR R package provides tools for using R in HPC settings that work with Slurm. It provides wrappers and functions that allow the user to seamlessly integrate their analysis pipeline with HPC clusters, putting emphasis on providing the user with a family of functions similar to those that the parallel R package provides.
First, some important discussion points within the context of Slurm+R
that users in general will find useful. Most of the points have to do
with options available for Slurm, and in particular, with the
sbatch
command with is used to submit batch jobs to Slurm.
Users who have used Slurm in the past may wish to skip this and continue
reading the following section.
Node A single computer in the HPC: A lot of
times jobs will be submitted to a single node. The simplest way of using
R+Slurm is submitting a single job and requesting multiple CPUs to use,
for example, parallel::parLapply
or
parallel::mclapply
. Usually users do not need to request an
specific number of nodes to be used as Slurm will allocate the resources
as needed.
A common mistake of R users is to specify the number of nodes and expect that their script will be parallelized. This won’t happen unless the user explicitly writes a parallel computing script.
The relevant flag for sbatch
is
--nodes
.
Partition A group of nodes in HPC. Generally large nodes may have multiple partitions, meaning that nodes may be grouped in various ways. For example, nodes belonging to a single group of users may be in a single partition, nodes dedicated to work with large data may be in another partition. Usually, partitions are associated with account privileges, so users may need to specify which account are they using when telling Slurm what partition they plan to use.
The relevant flag for sbatch
is
--partition
.
Account Accounts may be associated with partitions. Accounts can have privileges to use a partition or set of nodes. Often, users need to specify the account when submitting jobs to a particular partition.
The relevant flag for sbatch
is
--account
.
Task A step within a job. A particular job can have multiple tasks. tasks may span multiple nodes, so if the user wants to submit a multicore job, this option may not be the right one.
The relevant flag for sbatch
is
--ntasks
CPU generally this refers to core or thread
(which may be different in systems supporting multithreaded cores).
Users may want to specify how many CPUs they want to use for a task. And
this is the relevant option when using things like OpenMP or functions
that allow creating cluster objects in R
(e.g. makePSOCKcluster
, makeForkCluster
).
The relevant option in sbatch
is
--cpus-per-task
. More information regarding CPUs in Slurm
can be found here.
Information regarding how Slurm counts CPUs/cores/threads can be found
here.
Job Array Slurm supports job arrays. A job array
is in simple terms a job that is repeated multiple times by Slurm, this
is, replicates a single job as requested per the user. In the case of R,
when using this option, a single R script is spanned in multiple jobs,
so the user can take advantage of this and parallelize jobs across
multiple nodes. Besides from the fact that jobs within a Job Array may
be spanned across multiple nodes, each job in that array has a unique ID
that is available to the user via environment variables, in particular
SLURM_ARRAY_TASK_ID
.
Within R, and hence the Rscript submitted to Slurm, users can access
this environment variable with
Sys.getenv("SLURM_ARRAY_TASK_ID")
. Some of the
functionalities of slurmR
rely on Job Arrays.
More information on Job Arrays can be found here. The relevant
option for this in sbatch
is --array
.
More information about Slurm can be found their official website here. A tutorial about how to use Slurm with R can be found here.
In general, users will submit jobs to Slurm using the
sbatch
command line function. The sbatch
function’s main argument is the name (path) to a bash script that holds
the instructions (and sometimes options) associated to the program. Here
is an example of an bash file to be submitted to Slurm
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name="A long job"
#SBATCH --mem=5GB
#SBATCH --output=long-job.out
cd /path/where/to/start/the/job
# This may vary per HPC system. At USC's hpc system
# we use: source /usr/usc/R/default/setup.sh
module load R
Rscript --vanilla long-job-rscript.R
This example bash file, which we name “long-job-rscript.slurm”, has the following components:
#!/bin/bash
The interpreter directive that is common
to bash scripts.1
The #SBATCH
lines specify options for scheduling the
job. In order, these options are: Set a maximum time of 1 hour, name the
job A long job
, allocate 5GB of memory to the job, write
all the output (including Rscript
’s) to
long-job.out
.
The cd
line changes the directory to some other
place where the Rscript needs to be executed.
The module
line loads R. There are various ways to
do this, but it is a common requirement for the user to specify that it
will be using R.
Finally, Rscript
executes the R script named
long-job-rscript.R
.
This batch script can be submitted to Slurm using the
sbatch
command line tool:
This is what happens under-the-hood in slurmR
overall.
For more on this see this thread on StackExchange.↩︎