# Download the data
<- read.csv("https://noaa-isd-pds.s3.amazonaws.com/isd-history.csv")
stations $USAF <- as.integer(stations$USAF)
stations
# Dealing with blanks and 999999
$USAF[stations$USAF == 999999] <- NA
stations$CTRY[stations$CTRY == ""] <- NA
stations$STATE[stations$STATE == ""] <- NA
stations
# Selecting the three relevant columns, and keeping unique records
<- unique(stations[, c('USAF', 'CTRY', 'STATE')])
stations
# Dropping NAs
<- stations[!is.na(stations$USAF), ]
stations
# Removing duplicates
<- stations[!duplicated(stations$USAF), ] stations
Lab 5 - Data Wrangling
Learning goals
- Use the
merge()
function to join two datasets. - Deal with missingness and impute data.
- Identify relevant observations using
quantile()
.
Lab description
For this lab we will be, again, dealing with the meteorological dataset downloaded from the NOAA, the met
.
Setup
Load the
data.table
(and thedtplyr
anddplyr
packages if you plan to work with those).Load the met data from https://raw.githubusercontent.com/USCbiostats/data-science-data/master/02_met/met_all.gz, and also the station data. For the latter, you can use the code we used during lecture to pre-process the stations data:
- Merge the data as we did during the lecture.
Question 1: Representative station for the US
What is the median station in terms of temperature, wind speed, and atmospheric pressure? Look for the three weather stations that best represent continental US using the quantile()
function. Do these three coincide?
Question 2: Representative station per state
As in the previous question, calculate the median latitude and longitude for stations in each state.
Question 3: In the middle?
For each state, identify the station that is closest to the mean latitude and longitude for stations in that state. Combining these with the coordinates you identified in the previous question, use leaflet()
to visualize all ~100 points in the same figure, applying different colors for those identified in this question.
Question 4: Means of means
Compute each state’s average temperature and use that to classify them according to the following criteria:
- low: temp < 20
- Mid: temp >= 20 and temp < 25
- High: temp >= 25
Now generate a large summary table with states/observations split into the three categories listed above. For each group, compute the following:
- Number of states
- Number of entries (records)
- Number of NA entries
- Number of stations
- Mean temperature
- Mean wind-speed
- Mean atmospheric pressure