Workshop: Introduction to R
(for HPC users)

George G. Vega Yon
University of Southern California
Department of Preventive Medicine


USC Integrative Methods of Analysis for Genomic Epidemiology (IMAGE)
Department of Preventive Medicine
July 7th, 2018

Agenda

  1. Part 1: Talk about the basics: What is R? How to get help!?

  2. Part 2: R language fundamentals.

  3. Part 3: An extended example using HPCC

Part 1: Talk about the basics: What is R? How to get help!?

Before we start

We will be using:

The R programming language

A little bit of history

Picture by the New York Times https://nyti.ms/2GC3ruP
From left to right: John Chambers, Robert Gentleman, and Ross Ihaka.

(Source wiki)

The first lesson: Getting help

in R

In R, if you want to:

On the web

Books that I recommend

The first lesson: Getting help (How to read it?)

The first lesson: Getting help (a mental model)

My own personal way of looking for R-based solutions to my problems (in science… of course)

Questions

  1. Using the stats package, How can you estimate a generalized linear model in R?

  2. What is the command to transpose a matrix in R? What about the command for inverting a matrix?

  3. Looking at CRAN task Views, what R packages are available for convex optimization? What about stochastic optimization?

  4. Create a list of R packages that provide wrappers for working with Slurm.

  5. What does return the function for fitting nonlinear least squares in the stats package?

Part 2: R language fundamentals.

Creating objects

Data types

Attributes and Structure

Missing values

R has different types of Missing values:

str(c(NA, 1L))        # Integers can have NAs
##  int [1:2] NA 1
str(c(NaN, 1L, Inf))  # But not NaN or Inf (automatic coercion)
##  num [1:3] NaN 1 Inf
str(c(-Inf, 1, NULL, +Inf)) # And Nulls are... of length 0!
##  num [1:3] -Inf 1 Inf

Questions

  1. What is the mode of the following vector myvector <- c(NA, NaN, Inf)? (try not to use the mode() function in R)

  2. The c() function can be used with other vectors, for example

    my_integer_vector <- c(1L, 2L, 3L)    
    my_string_vector  <- c("hello", "world")

    What is the mode of the vector c(my_integer_vector, my_string_vector)?

  3. What do each one of the functions is.na(), is.null, is.infinite, is.finite(), and is.nan return on the vector myvector?

  4. What are the attributes of the following object mymat <- matrix(runif(12), ncol=4)?

Linear Algebra

Questions

Other fundamental types

Other fundamental type of objects in R are:

Statistical Functions

set.seed(12)
op <- par(mfrow = c(2,2))
hist(rnorm(1e5))
curve(qnorm)
curve(pnorm, xlim=c(-3, 3))
curve(dnorm, xlim=c(-3, 3))

par(op)
set.seed(12)
op <- par(mfrow = c(2,2))
hist(rexp(1e5))
curve(qexp)
curve(pexp, xlim=c(0, 6))
curve(dexp, xlim=c(0, 6))

par(op)

Questions

  1. Draw 1e5 samples from a chi2 with 2 degrees of freedom (hint: check ?Distributions).

  2. Draw 1e5 samples from a chi2 with 2 degrees of freedom using rnorm (hint: Recall that if \(X\sim N(0,1)\), then \(X^2\sim\chi^2_1\), and if \(X, Y\sim N(0,1)\), then \(X^2 + Y^2\sim\chi^2_2\)).

Control-flow statements

Functions

Part 3: An extended example using HPCC

Agenda

  1. High-Performance Computing: An overview

  2. Parallel computing in R

  3. Extended examples

First: How to use R on HPC

There are two things that you need to do to use R in HPC:

  1. Source the corresponding R version: For example, if you want to work with version 3.4, you could just type

    source /usr/usc/R/3.4.0/setup.sh

    You can also include that line in you ~/.bash_profile file so that is done automatically on your login.

  2. Specify the R library path: In order to be able to use R packages that where install in your session while running Slurm (for example), you have to specify the library path. There are a couple of ways of doing it:

    1. Use the .libPaths() command at the begining of your R script
    2. Use the lib.loc option when calling library()
    3. Use the .Renviron file and set the R_LIBS value (see ?Renviron)

We have examples at the end of the presentation.

High-Performance Computing: An overview

Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:

  1. Big data: How to work with data that doesn’t fit your computer

  2. Parallel computing: How to take advantage of multiple core systems

  3. Compiled code: Write your own low-level code (if R doesn’t has it yet…)

(Checkout CRAN Task View on HPC)

Big Data

Parallel computing

Flynn's Classical Taxonomy ([Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory](https://computing.llnl.gov/tutorials/parallel_comp/#Whatis))

Flynn’s Classical Taxonomy (Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory)

GPU vs CPU

[NVIDIA Blog](http://www.nvidia.com/object/what-is-gpu-computing.html)

NVIDIA Blog

When is it a good idea?

Ask yourself these questions before jumping into HPC!

Ask yourself these questions before jumping into HPC!

Parallel computing in R

While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism:

Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU.

Parallel workflow

  1. Create a cluster:

    1. PSOCK Cluster: makePSOCKCluster: Creates brand new R Sessions (so nothing is inherited from the master), even in other computers!

    2. Fork Cluster: makeForkCluster: Using OS Forking, copies the current R session locally (so everything is inherited from the master up to that point). Not available on Windows.

    3. Other: makeCluster passed to snow

  2. Copy/prepare each R session:

    1. Copy objects with clusterExport

    2. Pass expressions with clusterEvalQ

    3. Set a seed

  3. Do your call:

    1. mclapply, mcmapply if you are using Fork

    2. parApply, parLapply, etc. if you are using PSOCK

  4. Stop the cluster with clusterStop

parallel example 1: Parallel RNG

# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(2)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`

# 3. DO YOUR CALL
ans <- parSapply(cl, 1:2, function(x) runif(1e3))
(ans0 <- var(ans))
#               [,1]          [,2]
# [1,]  0.0861888293 -0.0001633431
# [2,] -0.0001633431  0.0853841838
# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:2, function(x) runif(1e3)))

all.equal(ans0, ans1) # All equal!
# [1] TRUE
# 4. STOP THE CLUSTER
stopCluster(cl)

parallel example 1: Parallel RNG (cont.)

In the case of makeForkCluster

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims <- 1e3
cl    <- makeForkCluster(2)    

# 2. PREPARING THE CLUSTER
RNGkind("L'Ecuyer-CMRG")
set.seed(123) 

# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:2, function(x) {
  runif(nsims) # Look! we use the nsims object!
               # This would have fail in makePSOCKCluster
               # if we didn't copy -nsims- first.
  }))
(ans0 <- var(ans))

# Same sequence with same seed
set.seed(123) 
ans1 <- var(do.call(cbind, mclapply(1:2, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros

# 4. STOP THE CLUSTER
stopCluster(cl)

parallel example 2: Simulating \(\pi\)

The R code to do this

pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}

parallel example 2: Simulating \(\pi\) (cont.)

# Setup
cl <- makePSOCKcluster(10)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]
#       test replications elapsed relative
# 1 parallel            1   0.650    1.000
# 2   serial            1   1.166    1.794
ans_par <- parSapply(cl, 1:100, pisim, nsim=nsim)
ans_ser <- sapply(1:100, pisim, nsim=nsim)
stopCluster(cl)
#      par      ser        R 
# 3.141561 3.141677 3.141593

Slurm Example 1

# Include this to tell where everything will be living at
.libPaths("~/R/x86_64-pc-linux-gnu-library/3.4/")

# Default CRAN mirror from where to download R packages
options(repos =c(CRAN="https://cloud.r-project.org/"))

# You need to have the ABCoptim R package
library(ABCoptim)

fun <- function(x) {
  -cos(x[1])*cos(x[2])*exp(-((x[1] - pi)^2 + (x[2] - pi)^2))
}

ans <- abc_optim(rep(0,2), fun, lb=-10, ub=10, criter=50)

saveRDS(
   ans,
   file = paste0(
      "~/hpc-with-r/examples/01-slurm-abcoptim-",
      Sys.getenv("SLURM_JOB_ID"),                 # SLURM ENV VAR
      "-",
      Sys.getenv("SLURM_ARRAY_TASK_ID"),          # SLURM ENV VAR
      ".rds"
))

Now you try it!

RcppArmadillo + OpenMP + Slurm: Using the rslurm package

Now you try it!

Thanks!

# R version 3.4.4 (2018-03-15)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.5 LTS
# 
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.0
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] parallel  stats     graphics  grDevices utils     datasets  methods  
# [8] base     
# 
# loaded via a namespace (and not attached):
#  [1] compiler_3.4.4  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
#  [5] tools_3.4.4     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.17   
#  [9] stringi_1.2.3   rmarkdown_1.10  jpeg_0.1-8      highr_0.7      
# [13] knitr_1.20      stringr_1.3.1   digest_0.6.15   evaluate_0.10.1

See also

For more, checkout the CRAN Task View on HPC

References

Jones, O., R. Maillardet, and A. Robinson. 2009. Introduction to Scientific Programming and Simulation Using R. Chapman & Hall/Crc the R Series. CRC Press. https://books.google.com/books?id=gnZC525wnzIC.

Marchand, Philippe. 2017. Rslurm: Submit R Calculations to a Slurm Cluster. https://CRAN.R-project.org/package=rslurm.

Matloff, N. 2011. The Art of R Programming: A Tour of Statistical Software Design. No Starch Press Series. No Starch Press. https://books.google.com/books?id=o2aLBAAAQBAJ.

Peng, R. 2012. R Programming for Data Science. Lulu.com. https://books.google.com/books?id=GSePDAEACAAJ.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Vega Yon, George, and Enyelbert Muñoz. 2017. ABCoptim: An Implementation of the Artificial Bee Colony (ABC) Algorithm. https://github.com/gvegayon/ABCoptim.

Wickham, H. 2015. Advanced R. Chapman & Hall/Crc the R Series. CRC Press. https://books.google.com/books?id=FfsYCwAAQBAJ.

Wickham, H., and G. Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. https://books.google.com/books?id=vfi3DQAAQBAJ.