Assignment 4 - HPC and SQL

Due Date

This assignment is due by 11:59pm Pacific Time on Friday, November 21st, 2025.

The learning objectives are to write faster code for a computational task requiring a loop and to implement some queries and basic data wrangling in SQL.

HPC

For Questions 1 and 2, rewrite the provided R functions to make them faster. It is OK (and recommended) to take a look at StackOverflow and Google.

Using the synthetic dataset provided, show that your new function returns the same output as the original function. Then use microbenchmark to show that your version is faster and provide a visualization of the runtimes.

Question 1 (20 points)

  • fun1: Find the total for each row of a numeric matrix. Output should be a vector with length equal to the number of rows of the input matrix.
# Total for each row
fun1 <- function(mat) {
  n <- nrow(mat)
  ans <- double(n) 
  for (i in 1:n) {
    ans[i] <- sum(mat[i, ])
  }
  ans
}

fun1alt <- function(mat) {
  # YOUR CODE HERE
}

Generate synthetic data (dat):

# Use the functions with this data
set.seed(2315)
dat <- matrix(rnorm(200 * 100), nrow = 200)

Question 2 (20 points)

  • fun2: Calculate the running (cumulative) total along each row of a numeric matrix. Output should be a matrix with dimensions equal to the input matrix.
# running cumulative sum by row
fun2 <- function(mat) {
  n <- nrow(mat)
  k <- ncol(mat)
  ans <- mat
  for (i in 1:n) {
    for (j in 2:k) {
      ans[i,j] <- mat[i, j] + ans[i, j - 1]
    }
  }
  ans
}

fun2alt <- function(mat) {
  # YOUR CODE HERE
}

Use the same synthetic datset as above.

Question 3 (20 points)

For this question, we will ask you to parallelize a process for computationally estimating the value of \(\pi\).

The following function allows us to estimate \(\pi\) through a simple simulation:

sim_pi <- function(n = 1000, i = NULL) {
  p <- matrix(runif(n*2), ncol = 2)
  mean(rowSums(p^2) < 1) * 4
}

# Here is an example of the run
set.seed(156)
sim_pi(1000) # 3.132

In order to get more accurate estimates, we can run this function multiple times and take the mean:

# This runs the simulation a 4,000 times, each with 10,000 points
set.seed(1231)
system.time({
  ans <- unlist(lapply(1:4000, sim_pi, n = 10000))
  print(mean(ans))
})

Rewrite the previous code using parLapply() (or your parallelization method of choice) to parallelize it. Run the code once, using system.time(), to show that your version is faster.

# YOUR CODE HERE
system.time({
  # YOUR CODE HERE
  ans <- # YOUR CODE HERE
  print(mean(ans))
  # YOUR CODE HERE
})

Visualize your results in two ways:

  1. Show that both versions produce similar ranges of estimates for \(\pi\).
  2. Show that the parallel version runs faster.

SQL

Setup a temporary database of movie data by running the following code chunk:

# install.packages(c("RSQLite", "DBI"))

library(RSQLite)
library(DBI)

# Initialize a temporary in memory database
con <- dbConnect(SQLite(), ":memory:")

# Download tables
film <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/film.csv")
film_category <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/film_category.csv")
category <- read.csv("https://raw.githubusercontent.com/ivanceras/sakila/master/csv-sakila-db/category.csv")

# Copy data.frames to database
dbWriteTable(con, "film", film)
dbWriteTable(con, "film_category", film_category)
dbWriteTable(con, "category", category)

When you write a new chunk, remember to replace the r with sql, connection=con. Some of these questions will require you to use an inner join. Read more about them here https://www.w3schools.com/sql/sql_join_inner.asp

Answer each of the questions below with SQL queries.

Question 4 (10 points)

How many movies are available in each rating category?

Question 5 (10 points)

What is the average replacement cost and rental rate for each rating category?

Question 6 (10 points)

Use table film_category together with film to find how many films there are with each category ID.

Question 7 (10 points)

Incorporate the category table into the answer to the previous question to find the name of the most popular category.