Create a Parallel Socket Cluster using Slurm — makeSlurmCluster • slurmR

This function is essentially a wrapper of the function parallel::makePSOCKcluster. makeSlurmCluster main feature is adding node addresses.

makeSlurmCluster(
  n,
  job_name = random_job_name(),
  tmp_path = opts_slurmR$get_tmp_path(),
  cluster_opt = list(),
  max_wait = 300L,
  verb = TRUE,
  ...
)

# S3 method for slurm_cluster
stopCluster(cl)

Arguments

n: Integer scalar. Size of the cluster object (see details).
job_name: Character. Name of the job to be passed to Slurm.
tmp_path: Character. Path to the directory where all the data (including scripts) will be stored. Notice that this path must be accessible by all the nodes in the network (See opts_slurmR).
cluster_opt: A list of arguments passed to parallel::makePSOCKcluster.
max_wait: Integer scalar. Wait time before exiting with error while trying to read the nodes information.
verb: Logical scalar. If TRUE, the function will print messages on screen reporting on the status of the job submission.
...: Further arguments passed to Slurm_EvalQ via sbatch_opt.
cl: An object of class slurm_cluster.

Value

A object of class c("slurm_cluster", "SOCKcluster", "cluster"). It is the same as what is returned by parallel::makePSOCKcluster with the main difference that it has two extra attributes:

SLURM_JOBID Which is the id of the Job that initialized that cluster.

Details

By default, if the time option is not specified via ..., then it is set to the value 01:00:00, this is, 1 hour.

Once a job is submitted via Slurm, the user gets access to the nodes associated with it, which allows users to star new processes within those. By means of this, we can create Socket, also known as "PSOCK", clusters across nodes in a Slurm environment. The name of the hosts are retrieved and passed later on to parallel::makePSOCKcluster.

It has been the case that R fails to create the cluster with the following message in the Slurm log file:

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive

In such cases, setting the memory, for example, upfront can solve the problem. For example:

cl <- makeSlurmCluster(20, mem = 20)

If the problem persists, i.e., the cluster cannot be created, make sure that your Slurm cluster allows Socket connections between nodes.

The method stopCluster for slurm_cluster stops the cluster doing the following:

Closes the connection by calling the stopCluster method for PSOCK objects.
Cancel the Slurm job using scancel.

Maximum number of connections

By default, R limits the number of simultaneous connections (see this thread in R-sig-hpc https://stat.ethz.ch/pipermail/r-sig-hpc/2012-May/001373.html) Current maximum is 128 (R version 3.6.1). To modify that limit, you would need to reinstall R updating the macro NCONNECTIONS in the file src/main/connections.c.

For now, if the user sets n above 128 it will get an immediate warning pointing to this issue, in particular, specifying that the cluster object may not be able to be created.

Examples

if (FALSE) {

# Creating a cluster with 100 workers/offpring/child R sessions
cl <- makeSlurmCluster(100)

# Computing the mean of a 100 random uniforms within each worker
# for this we can use any of the function available in the parallel package.
ans <- parSapply(1:200, function(x) mean(runif(100)))

# We simply call stopCluster as we would do with any other cluster
# object
stopCluster(ans)

# We can also specify SBATCH options directly (...)
cl <- makeSlurmCluster(200, partition = "thomas", time = "02:00:00")
stopCluster(cl)

}