Lab 5 - Data Wrangling

Learning goals

  • Use the merge() function to join two datasets.
  • Deal with missingness and impute data.
  • Identify relevant observations using quantile().

Lab description

For this lab we will be, again, dealing with the meteorological dataset downloaded from the NOAA, the met.

Setup

  1. Load the data.table (and the dtplyr and dplyr packages if you plan to work with those).

  2. Load the met data from https://raw.githubusercontent.com/USCbiostats/data-science-data/master/02_met/met_all.gz, and also the station data. For the latter, you can use the code we used during lecture to pre-process the stations data:

# Download the data
stations <- read.csv("https://noaa-isd-pds.s3.amazonaws.com/isd-history.csv")
stations$USAF <- as.integer(stations$USAF)

# Dealing with blanks and 999999
stations$USAF[stations$USAF == 999999] <- NA
stations$CTRY[stations$CTRY == ""] <- NA
stations$STATE[stations$STATE == ""] <- NA

# Selecting the three relevant columns, and keeping unique records
stations <- unique(stations[, c('USAF', 'CTRY', 'STATE')])

# Dropping NAs
stations <- stations[!is.na(stations$USAF), ]

# Removing duplicates
stations <- stations[!duplicated(stations$USAF), ]
  1. Merge the data as we did during the lecture.

Question 1: Representative station for the US

What is the median station in terms of temperature, wind speed, and atmospheric pressure? Look for the three weather stations that best represent continental US using the quantile() function. Do these three coincide?

Question 2: Representative station per state

As in the previous question, calculate the median latitude and longitude for stations in each state.

Question 3: In the middle?

For each state, identify the station that is closest to the mean latitude and longitude for stations in that state. Combining these with the coordinates you identified in the previous question, use leaflet() to visualize all ~100 points in the same figure, applying different colors for those identified in this question.

Question 4: Means of means

Compute each state’s average temperature and use that to classify them according to the following criteria:

  • low: temp < 20
  • Mid: temp >= 20 and temp < 25
  • High: temp >= 25

Now generate a large summary table with states/observations split into the three categories listed above. For each group, compute the following:

  • Number of states
  • Number of entries (records)
  • Number of NA entries
  • Number of stations
  • Mean temperature
  • Mean wind-speed
  • Mean atmospheric pressure