This function is essentially a wrapper of the function parallel::makePSOCKcluster.
makeSlurmCluster
main feature is adding node addresses.
makeSlurmCluster(
n,
job_name = random_job_name(),
tmp_path = opts_slurmR$get_tmp_path(),
cluster_opt = list(),
max_wait = 300L,
verb = TRUE,
...
)
# S3 method for slurm_cluster
stopCluster(cl)
Integer scalar. Size of the cluster object (see details).
Character. Name of the job to be passed to Slurm
.
Character. Path to the directory where all the data (including scripts) will be stored. Notice that this path must be accessible by all the nodes in the network (See opts_slurmR).
A list of arguments passed to parallel::makePSOCKcluster.
Integer scalar. Wait time before exiting with error while trying to read the nodes information.
Logical scalar. If TRUE
, the function will print messages on
screen reporting on the status of the job submission.
Further arguments passed to Slurm_EvalQ via sbatch_opt
.
An object of class slurm_cluster
.
A object of class c("slurm_cluster", "SOCKcluster", "cluster")
. It
is the same as what is returned by parallel::makePSOCKcluster with the main
difference that it has two extra attributes:
SLURM_JOBID
Which is the id of the Job that initialized that cluster.
By default, if the time
option is not specified via ...
,
then it is set to the value 01:00:00
, this is, 1 hour.
Once a job is submitted via Slurm, the user gets access to the nodes associated with it, which allows users to star new processes within those. By means of this, we can create Socket, also known as "PSOCK", clusters across nodes in a Slurm environment. The name of the hosts are retrieved and passed later on to parallel::makePSOCKcluster.
It has been the case that R fails to create the cluster with the following message in the Slurm log file:
In such cases, setting the memory, for example, upfront can solve the problem. For example:
If the problem persists, i.e., the cluster cannot be created, make sure that your Slurm cluster allows Socket connections between nodes.
The method stopCluster
for slurm_cluster
stops the cluster doing
the following:
Closes the connection by calling the stopCluster
method for PSOCK
objects.
Cancel the Slurm job using scancel
.
By default, R limits the number of simultaneous connections (see this thread
in R-sig-hpc https://stat.ethz.ch/pipermail/r-sig-hpc/2012-May/001373.html)
Current maximum is 128 (R version 3.6.1). To modify that limit, you would need
to reinstall R updating the macro NCONNECTIONS
in the file src/main/connections.c
.
For now, if the user sets n
above 128 it will get an immediate warning
pointing to this issue, in particular, specifying that the cluster object
may not be able to be created.
if (FALSE) {
# Creating a cluster with 100 workers/offpring/child R sessions
cl <- makeSlurmCluster(100)
# Computing the mean of a 100 random uniforms within each worker
# for this we can use any of the function available in the parallel package.
ans <- parSapply(1:200, function(x) mean(runif(100)))
# We simply call stopCluster as we would do with any other cluster
# object
stopCluster(ans)
# We can also specify SBATCH options directly (...)
cl <- makeSlurmCluster(200, partition = "thomas", time = "02:00:00")
stopCluster(cl)
}