This is the same old example that lots of people (including me) have been using to ilustrate parallel computing with R. The example is very simple, we want to approximate pi by doing some Monte Carlo simulations.
We know that the area of a circle is \(A = \pi r^2\), which is equivalent to say \(\pi = A/r^2\), so, if we can approximate the Area of a circle, then we can approximate \(\pi\). How do we do this?
Using montecarlo experiments, we have that the probability that a random point \(x\) falls within the unit circle can be approximated using the following formula
\[ \hat p = \frac{1}{n}\sum_i \mathbf{1}(x \in \mbox{Circle}) \]
This approximation, \(\hat p\), multiplied by the area of the escribed square, which has an area equal to \((2\times r)^2\), thus, we can finally write
\[ \hat \pi = \hat p \times (2\times r)^2 / r^2 = 4 \hat p \]The main way that we will be working is submitting jobs using the sbatch
function. This function takes as a main argument a bash file with the program to execute. In the case of R, a regular bash file looks something like this:
#!/bin/sh
#SBATCH --job-name=sapply
#SBATCH --time=00:10:00
module load usc r
Rscript --vanilla 01-sapply.R
This file has three components:
The Slurm flags #SBATCH
.
Loading R module load usc
and module load r
.
Executing the R script.
To submit a job the to queue, we need to enter the following:
sbatch 01-sapply.slurm
The following examples have two files, a bash script and a R script to be called by Slurm.
The most basic way is submitting a job using the sbatch
command. Im this case you need to have 2 files: (1) An R script, and (2) a bash script. e.g.
The contents of the R script (01-sapply.R) are:
# Model parameters
<- 1e3
nsims <- 1e4
n
# Function to simulate pi
<- function(i) {
simpi
<- matrix(runif(n*2, -1, 1), ncol = 2)
p mean(sqrt(rowSums(p^2)) <= 1) * 4
}
# Approximation
set.seed(12322)
<- sapply(1:nsims, simpi)
ans
message("Pi: ", mean(ans))
saveRDS(ans, "01-sapply.rds")
The contents of the bashfile (01-sapply.slurm) are:
#!/bin/sh
#SBATCH --job-name=sapply
#SBATCH --time=00:10:00
module load usc r--vanilla 01-sapply.R Rscript
Now, imagine that we would like to use more than one processor for this job, using something like the parallel::mclapply
function from the parallel package. Then, besides of adapting the code, we need to tell Slurm that we are using more than one core per-task, as the following example:
R script (02-mclapply.R):
# Model parameters
<- 1e3
nsims <- 1e4
n <- 4L
ncores
# Function to simulate pi
<- function(i) {
simpi
<- matrix(runif(n*2, -1, 1), ncol = 2)
p mean(sqrt(rowSums(p^2)) <= 1) * 4
}
# Approximation
set.seed(12322)
<- parallel::mclapply(1:nsims, simpi, mc.cores = ncores)
ans <- unlist(ans)
ans
message("Pi: ", mean(ans))
saveRDS(ans, "02-mclpply.rds")
Bashfile (02-mclapply.slurm):
#!/bin/sh
#SBATCH --job-name=mclapply
#SBATCH --time=00:10:00
#SBATCH --cpus-per-task=4
module load usc r
Rscript --vanilla 02-mclapply.R
In this case, there is no simple way to submit a multinodal job to Slurm… unless you use the slurmR package (see installation instructions here)
Once you have the slurmR package in your system, you can proceed as follow
R script (03-parsapply-slurmr.R):
# Model parameters
<- 1e3
nsims <- 1e4
n <- 4L
ncores
# Function to simulate pi
<- function(i) {
simpi
<- matrix(runif(n*2, -1, 1), ncol = 2)
p mean(sqrt(rowSums(p^2)) <= 1) * 4
}
# Setting up slurmR
library(slurmR) # This also loads the parallel package
# Making the cluster, and exporting the variables
<- makeSlurmCluster(ncores)
cl
# Approximation
clusterExport(cl, c("n", "simpi"))
<- parSapply(cl, 1:nsims, simpi)
ans
# Closing connection
stopCluster(cl)
message("Pi: ", mean(ans))
saveRDS(ans, "03-parsapply-slurmr.rds")
Bashfile (03-parsapply-slurmr.slurm):
#!/bin/sh
#SBATCH --job-name=parsapply
#SBATCH --time=00:10:00
module load usc r
Rscript --vanilla 03-parsapply-slurmr.R
Another way to submit jobs is using job arrays. A job array is essentially a job that is repreated njobs
times with the same configuration. The main difference between replicates is what you do with the SLURM_ARRAY_TASK_ID
environment variable. This variable is defined within each replicate and can be used to make the “subjob” depending on that.
Here is a quick example using R
<- Sys.getenv("SLURM_ARRAY_TASK_ID")
ID if (ID == 1) {
...[do this]...else if (ID == 2) {
}
...[do that]... }
The slurmR
R package makes submitting job arrays easy. Again, with the simulation of pi, we can do it in the following way:
R script (04-slurm_sapply.R):
# Model parameters
<- 1e3
nsims <- 1e4
n # ncores <- 4L
<- 4L
njobs
# Function to simulate pi
<- function(i, n.) {
simpi
<- matrix(runif(n.*2, -1, 1), ncol = 2)
p mean(sqrt(rowSums(p^2)) <= 1) * 4
}
# Setting up slurmR
library(slurmR) # This also loads the parallel package
# Approximation
<- Slurm_sapply(
ans 1:nsims, simpi,
n. = n,
njobs = njobs,
plan = "collect",
tmp_path = "/scratch/vegayon" # This is where all temp files will be exported
)
message("Pi: ", mean(ans))
saveRDS(ans, "04-slurm_sapply.rds")
Bashfile (04-slurm_sapply.slurm):
#!/bin/sh
#SBATCH --job-name=slurm_sapply
#SBATCH --time=00:10:00
module load usc r
Rscript --vanilla 04-slurm_sapply.R
One of the main benefits of using this approach instead of the the makeSlurmCluster
function (and thus, working with a SOCK cluster) are:
The number of jobs is not limited here (only by the admin, but not by R).
If a job fails, then we can re-run it using sbatch
once again (see example here).
You can check the individual logs of each process using the function Slurm_lob()
.
You can submit the job and quick the R session without waiting for it to finalize. You can always read back the job using the function read_slurm_job([path-to-the-temp])
The slurmR
package has a function named sourceSlurm
that can be used to avoid creating the .slurm
file. The user can add the SBATCH options to the top of the R script (including the #!/bin/sh
line) and submit the job from within R as follows:
R script (05-sapply.R):
#!/bin/sh
#SBATCH --job-name=sapply-sourceSlurm
#SBATCH --time=00:10:00
# Model parameters
<- 1e3
nsims <- 1e4
n
# Function to simulate pi
<- function(i) {
simpi
<- matrix(runif(n*2, -1, 1), ncol = 2)
p mean(sqrt(rowSums(p^2)) <= 1) * 4
}
# Approximation
set.seed(12322)
<- sapply(1:nsims, simpi)
ans
message("Pi: ", mean(ans))
saveRDS(ans, "05-sapply.rds")
From the R console (is OK if you are in the Head node)
::sourceSlurm("05-sapply.R") slurmR
And voilá! A temporary bash file will be generated and used submit the R script to the queue.