class: center, middle, title-slide .title[ # Data Visualization ] .subtitle[ ## PM 566: Introduction to Health Data Science ] --- <style type="text/css"> pre{ font-size:20px; } code.r,code.cpp{ font-size:large } </style> ## Acknowledgment These slides were originally developed by Meredith Franklin (and Paul Marjoram) and modified by George G. Vega Yon. --- ## Background <img src="img/ggplot2.png" width="25%" style="display: block; margin: auto;" /> This lecture provides an introduction to ggplot2, an R package that provides vastly better graphics options than R's default plots, histograms, etc. This section is based on chapter 3 of ["R for Data Science"](https://r4ds.had.co.nz/) --- ## Background `ggplot2` is part of the Tidyverse. The tidyverse is..."an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures." (https://www.tidyverse.org/) ```r library(tidyverse) library(data.table) ``` --- ## ggplot2 `ggplot2` is designed on the principle of adding layers. <img src="img/layers.png" width="40%" style="display: block; margin: auto;" /> --- ## ggplot2 * With ggplot2 a plot is initiated with the function `ggplot()` * The first argument of `ggplot()` is the dataset to use in the graph * Layers are added to `ggplot()` with `+` * Layers include `geom` functions such as point, lines, etc * Each `geom` function takes a `mapping` argument, which is always paired with `aes()` * The `aes()` mapping takes the x and y axes of the plot ```r ggplot(data = data) + geom_function(mapping = aes(mappings)) ``` --- ## Data Continuing with the weather data from last week, let's take the daily averages at each site, keeping some of the variables. Let's also create a new variable for region (east and west), categorize elevation, and create a multi-category variable for visibility for exploratory purposes. ```r # Reading the data, filtering, and replacing NAs met <- fread("met_all.gz") met <- met[met$temp > -10][elev == 9999.0, elev := NA] # Creating an aggregated version of the dataset met_avg <- met[,.( temp = mean(temp,na.rm=TRUE), rh = mean(rh,na.rm=TRUE), wind.sp = mean(wind.sp,na.rm=TRUE), vis.dist = mean(vis.dist,na.rm=TRUE), lat = mean(lat), lon = mean(lon), elev = mean(elev,na.rm=TRUE) ), by=c("USAFID", "day")] ``` --- ```r # New sets of variables using "fast" ifelse from data.table met_avg[, region := fifelse(lon > -98, "east", "west")] met_avg[, elev_cat := fifelse(elev > 252, "high", "low")] # Using the CUT function to create categories within the ranges met_avg[, vis_cat := cut( x = vis.dist, breaks = c(0, 1000, 6000, 10000, Inf), labels = c("fog", "mist", "haze", "clear"), right = FALSE )] ``` The variables we will focus on for this example are temp and rh (temperature in C and relative humidity %) --- ## Basic Scatterplot Here's how to create a basic plot in ggplot2 ```r ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh)) ``` <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="40%" style="display: block; margin: auto;" /> We see that as temperature increases, relative humidity decreases. --- ## Basic Scatterplot 2 `geom_point()` adds a layer of points to your plot, to create a scatterplot. -- `ggplot2` comes with many geom functions that each add a different type of layer to a plot. -- Each geom function in `ggplot2` takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. -- The mapping argument is always paired with `aes()`, and the x and y arguments of `aes()` specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, met_avg -- One common problem when creating ggplot2 graphics is to put the `+` in the wrong place: it has to come at the end of the line, not the start. --- ## Coloring by a variable - using aesthetics You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the region of data (west or east). `ggplot2` chooses colors, and adds a legend, automatically. ```r ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh, color = region)) ``` <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="40%" height="40%" style="display: block; margin: auto;" /> We see that humidity in the east is generally higher than in the west and that the hottest temperatures are in the west. --- ## Controlling point transparency using the "alpha" aesthetic ```r ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh, alpha = region)) ``` <img src="slides_files/figure-html/unnamed-chunk-11-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Controlling point shape: ```r ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh, shape = region)) ``` <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="50%" style="display: block; margin: auto;" /> Note that, by default, ggplot uses up to 6 shapes. If there are more, some of your data is not plotted!! (At least it warns you.) --- ### Manual control of aesthetics To control aesthetics manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes(). ```r ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh), color = "blue") ``` <img src="slides_files/figure-html/unnamed-chunk-13-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Summary of aesthetics The various aesthetics... |code | description | |-------------|:-------------------------:| | x | position on x-axis | | y | position on y-axis | | shape | shape | | color | color of element borders | | fill | color inside of elements | | size | size | | alpha | transparency | | linetype | type of line | --- ## Facets 1 Facets are particularly useful for categorical variables. ```r met_avg[!is.na(region)] %>% ggplot() + geom_point(mapping = aes(x = temp, y = rh, color=region)) + facet_wrap(~ region, nrow = 1) ``` <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Facets 2 Or you can facet on two variables... ```r met_avg[!is.na(region) & !is.na(elev_cat)] %>% ggplot() + geom_point(mapping = aes(x = temp, y = rh)) + facet_grid(region ~ elev_cat) ``` <img src="slides_files/figure-html/unnamed-chunk-15-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geometric objects 1 Geometric objects are used to control the type of plot you draw. So far we have used scatterplots (via `geom_point`). But now let's try plotting a smoothed line fitted to the data (and note how we do side-by-side plots) ```r library(cowplot) scatterplot <- ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh)) lineplot <- ggplot(data = met_avg) + geom_smooth(mapping = aes(x = temp, y = rh)) plot_grid(scatterplot, lineplot, labels = "AUTO") ``` <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geometric objects 1 `cowplot` is a package due to Claus Wilke, it "... is a simple add-on to `ggplot`. It provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images." --- ## Geometric objects 2 Note that not every aesthetic works with every geom function. But now there are some new ones we can use. ```r ggplot(data = met_avg) + geom_smooth(mapping = aes(x = temp, y = rh, linetype = region)) ``` <img src="slides_files/figure-html/unnamed-chunk-17-1.png" width="30%" style="display: block; margin: auto;" /> Here we make the line type depend on the region and we clearly see east has higher rh than west, but generally as temperatures increase rh decreases in both regions. --- ## Geometric objects 3 Histograms ```r ggplot(met_avg) + geom_histogram(mapping = aes(x = temp)) ``` <img src="slides_files/figure-html/unnamed-chunk-18-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geometric objects 4 Boxplots ```r met_avg[!is.na(elev_cat)] %>% ggplot()+ geom_boxplot(mapping=aes(x=elev_cat, y=temp, fill=elev_cat)) ``` <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geometric objects 5 Lineplots ```r ggplot(data = met_avg[elev==4113])+ geom_line(mapping=aes(x=day, y=temp)) ``` <img src="slides_files/figure-html/unnamed-chunk-20-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geometric objects 5 Polygons ```r world_map <- map_data("world") ggplot(data = world_map, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "darkgray", colour = "white") ``` <img src="slides_files/figure-html/unnamed-chunk-21-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geometric objects 5 Polygons ```r us_map <- map_data("state") ggplot(data = us_map, aes(x = long, y = lat, fill = region)) + geom_polygon(colour = "white") ``` <img src="slides_files/figure-html/unnamed-chunk-22-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Geoms - reference ggplot2 provides over 40 geoms, and extension packages provide even more (see https://ggplot2.tidyverse.org/reference/ for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf <img src="img/geoms.png" width="50%" style="display: block; margin: auto;" /> --- ## Multiple geoms 1 Let's layer geoms ```r met_avg[!is.na(region)] %>% ggplot() + geom_point(mapping = aes(x = temp, y = rh, color = region))+ geom_smooth(mapping = aes(x = temp, y = rh, linetype = region)) ``` <img src="slides_files/figure-html/unnamed-chunk-24-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Multiple geoms 2 We can avoid repetition of aesthetics by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. ```r met_avg[!is.na(region)] %>% ggplot(mapping = aes(x = temp, y = rh, color=region, linetype=region)) + geom_point() + geom_smooth() ``` <img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="30%" style="display: block; margin: auto;" /> --- ## Multiple geoms 2 `geom_smooth()` has options. For example if we want a linear regression line we add `method=lm` ```r met_avg[!is.na(region)] %>% ggplot(mapping = aes(x = temp, y = rh, color = region, linetype = region)) + geom_point() + geom_smooth(method = lm, se = FALSE, col = "black") ``` <img src="slides_files/figure-html/unnamed-chunk-26-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Multiple geoms 3 If you place mappings in a geom function, `ggplot2` will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers. ```r met_avg[!is.na(region)] %>% ggplot(mapping = aes(x = temp, y = rh)) + geom_point(mapping = aes(color = region)) + geom_smooth() ``` <img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="30%" style="display: block; margin: auto;" /> --- ## Multiple geoms 4 You can use the same idea to specify different data for each layer. Here, our smooth line displays the full met dataset but the points are colored by visibilty. ```r met_avg[!is.na(vis_cat)] %>% ggplot(mapping = aes(x = temp, y = rh, alpha = 0.5)) + geom_point(mapping = aes(color = vis_cat)) + geom_smooth(se = FALSE) ``` <img src="slides_files/figure-html/unnamed-chunk-28-1.png" width="30%" style="display: block; margin: auto;" /> --- ### Statistical transformationas - e.g. Bar charts ```r met_avg %>% filter(!(vis_cat %in% NA)) %>% ggplot() + geom_bar(mapping = aes(x = vis_cat)) ``` <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="40%" style="display: block; margin: auto;" /> The algorithm uses a built-in statistical transformation, called a "stat", to calcluate the counts. --- # Bar charts 2 You can over-ride the stat a geom uses to construct its plot. e.g., if we want to plpot proportions, rather than counts: ```r met_avg[!is.na(vis_cat)] %>% ggplot() + geom_bar(mapping = aes(x = vis_cat, y = stat(prop), group = 1)) ``` <img src="slides_files/figure-html/unnamed-chunk-30-1.png" width="40%" style="display: block; margin: auto;" /> --- # Coloring barcharts You can colour a bar chart using either the colour aesthetic, or, more usefully, fill: ```r met_avg[!is.na(vis_cat)] %>% ggplot() + geom_bar(mapping = aes(x = vis_cat, colour = vis_cat, fill=vis_cat)) ``` <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="40%" style="display: block; margin: auto;" /> --- # Coloring barcharts More interestingly, you can fill by another variable (here, 'region'). We also show that we can change the color scale. ```r met_avg[!is.na(vis_cat) & vis_cat != "clear"] %>% ggplot() + geom_bar(mapping = aes(x = vis_cat, fill = region))+ scale_fill_viridis_d() ``` <img src="slides_files/figure-html/unnamed-chunk-32-1.png" width="40%" style="display: block; margin: auto;" /> --- # Coloring barcharts `position = "dodge"` places overlapping objects directly beside one another. This makes it easier to compare individual values. ```r met_avg[!is.na(vis_cat) & vis_cat != "clear"] %>% ggplot() + geom_bar(mapping = aes(x = elev_cat, fill = vis_cat), position = "dodge") ``` <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Statistical transformations - another example You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you're computing: ```r l <- met_avg[!is.na(vis_cat) & vis_cat != "clear"] %>% ggplot() + stat_summary(mapping = aes(x = vis_cat, y = temp), fun.min = min, fun.max = max, fun = median) ``` --- ## Statistical transformations - another example ```r l ``` <img src="slides_files/figure-html/unnamed-chunk-35-1.png" width="700px" style="display: block; margin: auto;" /> --- ## Position adjustments An option that can be very useful is `position = "jitter"`. This adds a small amount of random noise to each point. This spreads out points that might otherwise be overlapping. ```r nojitter <- ggplot(data = met_avg[1:1000,]) + geom_point(mapping = aes(x = vis_cat, y = temp)) jitter <- ggplot(data = met_avg[1:1000,]) + geom_point(mapping = aes(x = vis_cat, y = temp), position = "jitter") ``` --- ## Position adjustments An option that can be very useful is `position = "jitter"`. This adds a small amount of random noise to each point. This spreads out points that might otherwise be overlapping. ```r nojitter <- ggplot(data = met_avg[1:1000,]) + geom_point(mapping = aes(x = vis_cat, y = temp)) jitter <- ggplot(data = met_avg[1:1000,]) + geom_point(mapping = aes(x = vis_cat, y = temp), position = "jitter") ``` --- ## Position adjustments ```r plot_grid(nojitter, jitter, labels = "AUTO") ``` <img src="slides_files/figure-html/unnamed-chunk-38-1.png" width="700px" style="display: block; margin: auto;" /> --- ## Coordinate systems Coordinate systems are one of the more complicated corners of ggplot. To start with something simple, here's how to flip axes: ```r unflipped <- ggplot(data = met_avg) + geom_boxplot(mapping = aes(x = vis_cat, y = temp)) flipped <- ggplot(data = met_avg) + geom_boxplot(mapping = aes(x = vis_cat, y = temp)) + coord_flip() ``` --- ## Coordinate systems ```r plot_grid(unflipped, flipped, labels = "AUTO") ``` <img src="slides_files/figure-html/unnamed-chunk-40-1.png" width="700px" style="display: block; margin: auto;" /> --- ## Coordinate systems There is also the ability to control the aspect ratio using `coord_quickmap()` and to use polar coordinates with `coord_polar()`. ```r bar <- ggplot(data = met_avg) + geom_bar(mapping = aes(x = elev_cat, fill = elev_cat), show.legend = FALSE, width = 1) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL) bar + coord_flip() ``` <img src="slides_files/figure-html/unnamed-chunk-41-1.png" width="30%" style="display: block; margin: auto;" /> ```r bar + coord_polar() ``` <img src="slides_files/figure-html/unnamed-chunk-41-2.png" width="30%" style="display: block; margin: auto;" /> --- ## Coordinate systems ```r bar + coord_flip() ``` <img src="slides_files/figure-html/unnamed-chunk-42-1.png" width="30%" style="display: block; margin: auto;" /> ```r bar + coord_polar() ``` <img src="slides_files/figure-html/unnamed-chunk-42-2.png" width="30%" style="display: block; margin: auto;" /> --- ## Modifying labels ```r ggplot(met_avg[!is.na(region)]) + geom_point(aes(temp, rh, color = region)) + labs(title = "Weather Station Data") + labs(x = expression("Temperature" *~ degree * C), y = "Relative Humidity") ``` <img src="slides_files/figure-html/unnamed-chunk-43-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Changing the Theme ```r ggplot(met_avg[!is.na(region)]) + geom_point(aes(temp, rh, color = region)) + labs(title = "Weather Station Data") + labs(x = expression("Temperature"*~degree*C), y = "Relative Humidity")+ theme_bw(base_family = "Times") ``` <img src="slides_files/figure-html/unnamed-chunk-44-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Changing the Legend ```r ggplot(met_avg[!is.na(region)]) + geom_point(aes(temp, rh, color = region)) + labs(title = "Weather Station Data",x = expression("Temperature"*~degree*C), y = "Relative Humidity")+ scale_color_manual(name="Region", labels=c("East", "West"), values=c("east"="lightblue", "west"="purple"))+ theme_bw(base_family = "Times") ``` <img src="slides_files/figure-html/unnamed-chunk-45-1.png" width="40%" style="display: block; margin: auto;" /> --- ## Changing Colorscales ```r ggplot(data = met_avg) + geom_point(mapping=aes(x=temp, y=rh, color=elev))+ scale_color_gradient(low="blue", high="red") ``` <img src="slides_files/figure-html/unnamed-chunk-46-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Changing Colorscales ```r ggplot(data=met_avg) + geom_point(mapping= aes(x=temp, y=rh, color = cut(elev, b=5))) + scale_color_manual(values = viridis::viridis(6)) ``` <img src="slides_files/figure-html/unnamed-chunk-47-1.png" width="50%" style="display: block; margin: auto;" /> --- ## A Great reference A great (comprehensive) reference for everything you can do with ggplot2 is the R Graphics Cookbook: https://r-graphics.org/ --- ## Reminder - the ggplot2 cheatsheet A briefer summary can be found here: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf Rstudio has a variety of other great Cheatsheets. --- ## Maps with leaflet Let's create a map of monthly average temperatures at each of the weather stations and colour the points by a temperature gradient. We need to create a colour palette and we can add a legend. ```r library(leaflet) met_avg2 <- met[,.(temp = mean(temp,na.rm=TRUE), lat = mean(lat), lon = mean(lon)), by=c("USAFID")] met_avg2 <- met_avg2[!is.na(temp)] # Generating a color palette temp.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=met_avg2$temp) temp.pal ``` ``` ## function (x) ## { ## if (length(x) == 0 || all(is.na(x))) { ## return(pf(x)) ## } ## if (is.null(rng)) ## rng <- range(x, na.rm = TRUE) ## rescaled <- scales::rescale(x, from = rng) ## if (any(rescaled < 0 | rescaled > 1, na.rm = TRUE)) ## warning("Some values were outside the color scale and will be treated as NA") ## if (reverse) { ## rescaled <- 1 - rescaled ## } ## pf(rescaled) ## } ## <bytecode: 0x7fceccceea70> ## <environment: 0x7fcecccf04f8> ## attr(,"colorType") ## [1] "numeric" ## attr(,"colorArgs") ## attr(,"colorArgs")$na.color ## [1] "#808080" ``` --- ## Maps with leaflet For the tile providers, take a look at this site: https://leaflet-extras.github.io/leaflet-providers/preview/ ```r tempmap <- leaflet(met_avg2) %>% # The looks of the Map addProviderTiles('CartoDB.Positron') %>% # Some circles addCircles( lat = ~lat, lng=~lon, # HERE IS OUR PAL! label = ~paste0(round(temp,2), ' C'), color = ~ temp.pal(temp), opacity = 1, fillOpacity = 1, radius = 500 ) %>% # And a pretty legend addLegend('bottomleft', pal=temp.pal, values=met_avg2$temp, title='Temperature, C', opacity=1) ``` --- ## Maps with leaflet ```r tempmap ```