fun1 <- function(n = 100, k = 4, lambda = 4) {
x <- NULL
for (i in 1:n){
x <- rbind(x, rpois(k, lambda))
}
return(x)
}
fun1alt <- function(n = 100, k = 4, lambda = 4) {
# YOUR CODE HERE
}Lab 8 - Efficiency and Parallelization
Learning goals
In this lab, you are expected to learn/put in practice the following skills:
- Evaluate whether a problem can be parallelized or not.
- Practice with the
parallelpackage.
Problem 1: Vectorization
The following functions can be written to be more efficient without using parallelization. Write a faster version of each function and show that (1) the outputs are the same as the slow version, and (2) your version is faster.
- This function generates an
n x kdataset with all its entries drawn from a Poission distribution with meanlambda.
Show that fun1alt generates a matrix with the same dimensions as fun1 and that the values inside the two matrices follow similar distributions. Then check the speed of the two functions with the following code:
# Benchmarking
microbenchmark::microbenchmark(
fun1(),
fun1alt()
)- This function finds the maximum value of each column of a matrix.
# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow=10)
# Find each column's max value
fun2 <- function(x) {
apply(x, 2, max)
}
fun2alt <- function(x) {
# YOUR CODE HERE
}Show that both functions return the same output for a given input matrix, x. Then check the speed of the two functions, as you did above.
Problem 3: Parallelization
We will now turn our attention to the statistical concept of bootstrapping. Among its many uses, non-parametric bootstrapping allows us to obtain confidence intervals for parameter estimates without relying on parametric assumptions. Don’t worry if these concepts are unfamiliar, we only care about the computation methods in this lab, not the statistics.
The main assumption is that we can approximate the results of many repeated experiments by resampling observations from our original dataset, which reflects the population.
- This function implements a serial version of the bootstrap. Edit this function to parallelize the
lapplyloop, using whichever method you prefer. Rather than specifying the number of cores to use, use the number given by thencpusargument, so that we can test it with different numbers of cores later.
my_boot <- function(dat, stat, R, ncpus = 1L) {
# Getting the random indices
n <- nrow(dat)
idx <- matrix(sample.int(n, n*R, TRUE), nrow=n, ncol=R)
# THIS FUNCTION NEEDS TO BE PARALELLIZED
# EDIT THIS CODE:
ans <- lapply(seq_len(R), function(i) {
stat(dat[idx[,i], , drop=FALSE])
})
# Converting the list into a matrix
ans <- do.call(rbind, ans)
return(ans)
}- Once you have a version of the
my_boot()function that runs on multiple cores, check that it provides accurate results by comparing it to a parametric model:
# Bootstrap of an OLS
my_stat <- function(d) coef(lm(y ~ x, data=d))
# DATA SIM
set.seed(1)
n <- 500; R <- 1e4
x <- cbind(rnorm(n)); y <- x*5 + rnorm(n)
# Checking if we get something similar as lm
ans0 <- confint(lm(y~x))
ans1 <- my_boot(dat = data.frame(x, y), my_stat, R = R, ncpus = 2L)
# You should get something like this
t(apply(ans1, 2, quantile, c(.025,.975)))
## 2.5% 97.5%
## (Intercept) -0.1372435 0.05074397
## x 4.8680977 5.04539763
ans0
## 2.5 % 97.5 %
## (Intercept) -0.1379033 0.04797344
## x 4.8650100 5.04883353- Check whether your version actually goes faster when it’s run on multiple cores (since this might take a little while to run, we’ll use
system.timeand just run each version once, rather thanmicrobenchmark, which would run each version 100 times, by default):
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 1L))
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 2L))