<- function(n = 100, k = 4, lambda = 4) {
fun1 <- NULL
x
for (i in 1:n){
<- rbind(x, rpois(k, lambda))
x
}
return(x)
}
<- function(n = 100, k = 4, lambda = 4) {
fun1alt # YOUR CODE HERE
}
Lab 8 - Efficiency and Parallelization
Learning goals
In this lab, you are expected to learn/put in practice the following skills:
- Evaluate whether a problem can be parallelized or not.
- Practice with the
parallel
package.
Problem 1: Vectorization
The following functions can be written to be more efficient without using parallelization. Write a faster version of each function and show that (1) the outputs are the same as the slow version, and (2) your version is faster.
- This function generates an
n x k
dataset with all its entries drawn from a Poission distribution with meanlambda
.
Show that fun1alt
generates a matrix with the same dimensions as fun1
and that the values inside the two matrices follow similar distributions. Then check the speed of the two functions with the following code:
# Benchmarking
::microbenchmark(
microbenchmarkfun1(),
fun1alt()
)
- This function finds the maximum value of each column of a matrix.
# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
<- matrix(rnorm(1e4), nrow=10)
x
# Find each column's max value
<- function(x) {
fun2 apply(x, 2, max)
}
<- function(x) {
fun2alt # YOUR CODE HERE
}
Show that both functions return the same output for a given input matrix, x
. Then check the speed of the two functions, as you did above.
Problem 3: Parallelization
We will now turn our attention to the statistical concept of bootstrapping. Among its many uses, non-parametric bootstrapping allows us to obtain confidence intervals for parameter estimates without relying on parametric assumptions. Don’t worry if these concepts are unfamiliar, we only care about the computation methods in this lab, not the statistics.
The main assumption is that we can approximate the results of many repeated experiments by resampling observations from our original dataset, which reflects the population.
- This function implements a serial version of the bootstrap. Edit this function to parallelize the
lapply
loop, using whichever method you prefer. Rather than specifying the number of cores to use, use the number given by thencpus
argument, so that we can test it with different numbers of cores later.
<- function(dat, stat, R, ncpus = 1L) {
my_boot
# Getting the random indices
<- nrow(dat)
n <- matrix(sample.int(n, n*R, TRUE), nrow=n, ncol=R)
idx
# THIS FUNCTION NEEDS TO BE PARALELLIZED
# EDIT THIS CODE:
<- lapply(seq_len(R), function(i) {
ans stat(dat[idx[,i], , drop=FALSE])
})
# Converting the list into a matrix
<- do.call(rbind, ans)
ans
return(ans)
}
- Once you have a version of the
my_boot()
function that runs on multiple cores, check that it provides accurate results by comparing it to a parametric model:
# Bootstrap of an OLS
<- function(d) coef(lm(y ~ x, data=d))
my_stat
# DATA SIM
set.seed(1)
<- 500; R <- 1e4
n
<- cbind(rnorm(n)); y <- x*5 + rnorm(n)
x
# Checking if we get something similar as lm
<- confint(lm(y~x))
ans0 <- my_boot(dat = data.frame(x, y), my_stat, R = R, ncpus = 2L)
ans1
# You should get something like this
t(apply(ans1, 2, quantile, c(.025,.975)))
## 2.5% 97.5%
## (Intercept) -0.1372435 0.05074397
## x 4.8680977 5.04539763
ans0## 2.5 % 97.5 %
## (Intercept) -0.1379033 0.04797344
## x 4.8650100 5.04883353
- Check whether your version actually goes faster when it’s run on multiple cores (since this might take a little while to run, we’ll use
system.time
and just run each version once, rather thanmicrobenchmark
, which would run each version 100 times, by default):
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 1L))
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 2L))