Data Visualization

PM 566: Introduction to Health Data Science

Acknowledgment

These slides were originally developed by Meredith Franklin (and Paul Marjoram) and modified by George G. Vega Yon and Kelly Street.

Background

This lecture provides an introduction to R’s basic plotting functions as well as the ggplot2 package.

This section is based on chapter 3 of “R for Data Science”

Background

ggplot2 is part of the Tidyverse. The tidyverse is…“an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” (https://www.tidyverse.org/)

library(tidyverse)
library(data.table)

ggplot2

ggplot2 is designed on the principle of adding layers.

ggplot2

  • With ggplot2 a plot is initiated with the function ggplot()
  • The first argument of ggplot() is the dataset to use in the graph
  • Layers are added to ggplot() with +
  • Layers include geom functions such as point, lines, etc
  • Each geom function takes a mapping argument, which is always paired with aes()
  • The aes() mapping takes the x and y axes of the plot
ggplot(data = data) +
    geom_function(mapping = aes(mappings))

Data

Continuing with the weather data from last week, let’s take the daily averages at each site, keeping some of the variables.

# Reading the data, filtering, and replacing NAs
met <- fread(file.path('..','03-exploratory','met_all.gz'))
met <- met[met$temp > -10][elev == 9999.0, elev := NA]

# Creating a smaller version of the dataset, averaged by day and weather station
met_avg <- met[,.(
  temp     = mean(temp,na.rm=TRUE),
  rh       = mean(rh,na.rm=TRUE),
  wind.sp  = mean(wind.sp,na.rm=TRUE),
  vis.dist = mean(vis.dist,na.rm=TRUE),
  lat      = mean(lat),
  lon      = mean(lon), 
  elev     = mean(elev,na.rm=TRUE)
), by=c("USAFID", "day")]

Let’s also create a new variable for region (east and west), categorize elevation, and create a multi-category variable for visibility for exploratory purposes.

# New sets of variables using "fast" ifelse from data.table
met_avg[, region   := fifelse(lon > -98, "east", "west")]
met_avg[, elev_cat := fifelse(elev > 252, "high", "low")]

# Using the CUT function to create categories within the ranges
met_avg[, vis_cat  := cut(
  x      = vis.dist,
  breaks = c(0, 1000, 6000, 10000, Inf),
  labels = c("fog", "mist", "haze", "clear"),
  right  = FALSE
)]

The variables we will focus on for this example are temp and rh (temperature in C and relative humidity %)

Basic Scatterplot

plot(met_avg$temp, met_avg$rh)

Basic Scatterplot

Here’s how to create a basic plot in ggplot2

ggplot(data = met_avg) + 
  geom_point(mapping = aes(x = temp, y = rh))

We see that as temperature increases, relative humidity decreases.

Basic Scatterplot 2

  • geom_point() adds a layer of points to your plot, to create a scatterplot.
  • ggplot2 comes with many geom functions that each add a different type of layer to a plot.
  • Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties.
  • The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, met_avg
  • One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start.

Coloring by a variable

You can map the colors of your points to the class variable to reveal the region of data (west or east). In base plot, we need to convert the “character” variable into a “factor” variable in order to color by it.

plot(met_avg$temp, met_avg$rh, col = factor(met_avg$region))

We see that humidity in the east is generally higher than in the west and that the hottest temperatures are in the west.

Coloring by a variable - using aesthetics

Alternatively, ggplot2 can color by a “character” variable and adds a legend automatically.

ggplot(data = met_avg) + 
  geom_point(mapping = aes(x = temp, y = rh, color = region))

Controlling point transparency using the “alpha” aesthetic

ggplot(data = met_avg) + 
  geom_point(mapping = aes(x = temp, y = rh, alpha = 0.3))

Controlling point transparency using the “alpha” aesthetic

ggplot(data = met_avg) + 
  geom_point(mapping = aes(x = temp, y = rh, alpha = rh))

The base plot alternative is to use the alpha function from the scales package.

Controlling point shape:

ggplot(data = met_avg) + 
  geom_point(mapping = aes(x = temp, y = rh, shape = region))

Note that, by default, ggplot uses up to 6 shapes. If there are more, some of your data is not plotted!! (At least it warns you.) In base plot, point shape is controlled by the pch (“plotting character”) argument.

Manual control of aesthetics

To control aesthetics manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes().

ggplot(data = met_avg) + 
  geom_point(mapping = aes(x = temp, y = rh), color = "blue")

Equivalent to col = "blue" in base plot.

Summary of aesthetics

code description
x position on x-axis
y position on y-axis
shape shape
color color of element borders
fill color inside of elements
size size
alpha transparency
linetype type of line

Base plot equivalents

code description
first arg / x position on x-axis
second arg / y position on y-axis
pch shape
col color of element borders
fill color inside of elements
cex size
scales::alpha transparency
lty type of line

Add points to plot

With base plot, you can add points to an existing plot with points(), which takes the same arguments as plot() for plotting points.

plot(1:10, pch = 16)
points(10:1, pch = 16, col = 2)

Facets 1

Facets are particularly useful for categorical variables and ggplot makes them quite easy.

met_avg[!is.na(region)] |> 
  ggplot() + 
  geom_point(mapping = aes(x = temp, y = rh, color=region)) + 
  facet_wrap(~ region, nrow = 1)

Facets 2

Or you can facet on two variables…

met_avg[!is.na(region) & !is.na(elev_cat)] |> 
  ggplot() + 
  geom_point(mapping = aes(x = temp, y = rh)) + 
  facet_grid(region ~ elev_cat)

Facets 3

Base plot is not good at this! You can make multiple plots within a single plotting window by utilizing the layout() function, but you will still have to make each plot manually.

layout(matrix(1:2, nrow=1))
plot(met_avg$temp[which(met_avg$region == 'east')], met_avg$rh[which(met_avg$region == 'east')], pch = 16, col = 2)
plot(met_avg$temp[which(met_avg$region == 'west')], met_avg$rh[which(met_avg$region == 'west')], pch = 16, col = 4)

Geometric objects 1

Geometric objects are used to control the type of plot you draw. So far we have used scatterplots (via geom_point). But now let’s try plotting a smoothed line fitted to the data (and note how we do side-by-side plots)

library(cowplot)

scatterplot <- ggplot(data = met_avg) + geom_point(mapping = aes(x = temp, y = rh))
lineplot    <- ggplot(data = met_avg) + geom_smooth(mapping = aes(x = temp, y = rh))

plot_grid(scatterplot, lineplot, labels = "AUTO")

Geometric objects 1

cowplot is a package due to Claus Wilke, it “… is a simple add-on to ggplot. It provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images.”

Geometric objects 1

Although ggplot glosses over it, the smoothed line is fit by LOESS regression. So plotting one in base plot requires some additional code beforehand:

smooth <- loess(met_avg$rh ~ met_avg$temp)

plot(met_avg$temp, met_avg$rh)
lines(smooth$x[order(smooth$x)], smooth$fitted[order(smooth$x)], col = 2, lwd = 2)

Note the lines function, which adds lines to an existing plot. We have to order the values, because otherwise they remain in the same (unsorted) order as the original dataset.

Geometric objects 2

Note that not every aesthetic works with every geom function. But now there are some new ones we can use.

ggplot(data = met_avg) + 
  geom_smooth(mapping = aes(x = temp, y = rh, linetype = region))

Here we make the line type depend on the region and we clearly see east has higher rh than west, but generally as temperatures increase, humidity decreases in both regions.

Geometric objects 3

Histograms

hist(met_avg$temp)

Geometric objects 3

Histograms

ggplot(met_avg) + 
  geom_histogram(mapping = aes(x = temp))

Geometric objects 4

Boxplots

boxplot(met_avg$temp ~ met_avg$elev_cat)

Geometric objects 4

Boxplots

met_avg[!is.na(elev_cat)] |> 
  ggplot()+
  geom_boxplot(mapping=aes(x=elev_cat, y=temp, fill=elev_cat)) 

Geometric objects 5

Lineplots

plot(met_avg$day[met_avg$elev==4113], met_avg$temp[met_avg$elev==4113], type = 'l')

Geometric objects 5

Just as you can add points to an existing plot, you can also add lines()

plot(1:10, pch = 16)
lines(10:1, col = 2, lwd = 3) # add a thick red line

Geometric objects 5

Lineplots

ggplot(data = met_avg[elev==4113])+
 geom_line(mapping=aes(x=day, y=temp))

Geometric objects 5

Polygons

world_map <- map_data("world")
ggplot(data = world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "darkgray", colour = "white")

Geometric objects 5

Polygons

us_map <- map_data("state")
ggplot(data = us_map, aes(x = long, y = lat, fill = region)) +
  geom_polygon(colour = "white")

Geometric objects 5

To plot this map data with baseplot, we have to use the range trick to set up the plot first:

plot(range(us_map$long), range(us_map$lat), asp = 1, col = 'white')
for(i in unique(us_map$group)){
  subset <- us_map[us_map$group == i, ]
  polygon(subset$long, subset$lat)
}

Note the argument asp = 1 which sets the correct aspect ratio of 1:1.

Geoms - reference

ggplot2 provides over 40 geoms, and extension packages provide even more (see https://ggplot2.tidyverse.org/reference/ for a sampling).

The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

Multiple geoms 1

Let’s layer geoms

met_avg[!is.na(region)] |>
  ggplot() + 
  geom_point(mapping = aes(x = temp, y = rh, color = region))+
  geom_smooth(mapping = aes(x = temp, y = rh, linetype = region))

Multiple geoms 2

We can avoid repetition of aesthetics by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph.

met_avg[!is.na(region)] |>
  ggplot(mapping = aes(x = temp, y = rh, color=region, linetype=region)) +
  geom_point() + 
  geom_smooth()

Multiple geoms 2

geom_smooth() has options. For example if we want a linear regression line we add method=lm

met_avg[!is.na(region)] |>
  ggplot(mapping = aes(x = temp, y = rh, color = region, linetype = region)) +
  geom_point() + 
  geom_smooth(method = lm, se = FALSE, col = "black")

Multiple geoms 3

If you place mappings in a geom function, ggplot2 will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.

met_avg[!is.na(region)] |>
  ggplot(mapping = aes(x = temp, y = rh)) + 
  geom_point(mapping = aes(color = region)) + 
  geom_smooth()

Multiple geoms 4

You can use the same idea to specify different data for each layer. Here, our smooth line displays the full met dataset but the points are colored by visibilty.

met_avg[!is.na(vis_cat)] |>
  ggplot(mapping = aes(x = temp, y = rh, alpha = 0.5)) + 
  geom_point(mapping = aes(color = vis_cat)) + 
  geom_smooth(se = FALSE)

Statistical transformationas - e.g. Bar charts

Let’s say we want to know the frequencies of the different visibility categories.

tab <- table(met_avg$vis_cat)
barplot(tab)

Statistical transformationas - e.g. Bar charts

met_avg |>
filter(!(vis_cat %in% NA)) |> 
  ggplot() + 
  geom_bar(mapping = aes(x = vis_cat))

The algorithm uses a built-in statistical transformation, called a “stat”, to calculate the counts.

Bar charts 2

You can over-ride the stat a geom uses to construct its plot. e.g., if we want to plpot proportions, rather than counts:

met_avg[!is.na(vis_cat)] |>
  ggplot() + 
  geom_bar(mapping = aes(x = vis_cat, y = stat(prop), group = 1))

Coloring barcharts

You can colour a bar chart using either the colour aesthetic, or, more usefully, fill:

met_avg[!is.na(vis_cat)] |>
  ggplot() + 
  geom_bar(mapping = aes(x = vis_cat, colour = vis_cat, fill=vis_cat))

Coloring barcharts

More interestingly, you can fill by another variable (here, ‘region’). We also show that we can change the color scale.

met_avg[!is.na(vis_cat) & vis_cat != "clear"] |>
  ggplot() + 
  geom_bar(mapping = aes(x = vis_cat, fill = region))+
  scale_fill_viridis_d() 

Coloring barcharts

tab <- table(met_avg$region[!is.na(met_avg$vis_cat) & met_avg$vis_cat != "clear"],
             met_avg$vis_cat[!is.na(met_avg$vis_cat) & met_avg$vis_cat != "clear"])
barplot(tab, col = c(2,4))

Coloring barcharts

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

met_avg[!is.na(vis_cat) & vis_cat != "clear"] |>
  ggplot() + 
  geom_bar(mapping = aes(x = elev_cat, fill = vis_cat), position = "dodge")

Statistical transformations - another example

You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:

l <- met_avg[!is.na(vis_cat) & vis_cat != "clear"] |>
  ggplot() + 
    stat_summary(mapping = aes(x = vis_cat, y = temp),
    fun.min = min,
    fun.max = max,
    fun = median)

Statistical transformations - another example

l

Position adjustments

An option that can be very useful is position = "jitter". This adds a small amount of random noise to each point. This spreads out points that might otherwise be overlapping.

nojitter <- ggplot(data = met_avg[1:1000,]) + 
  geom_point(mapping = aes(x = vis_cat, y = temp))

jitter <- ggplot(data = met_avg[1:1000,]) + 
  geom_point(mapping = aes(x = vis_cat, y = temp), position = "jitter")

Position adjustments

An option that can be very useful is position = "jitter". This adds a small amount of random noise to each point. This spreads out points that might otherwise be overlapping.

nojitter <- ggplot(data = met_avg[1:1000,]) + 
  geom_point(mapping = aes(x = vis_cat, y = temp))

jitter <- ggplot(data = met_avg[1:1000,]) + 
  geom_point(mapping = aes(x = vis_cat, y = temp), position = "jitter")

Position adjustments

plot_grid(nojitter, jitter, labels = "AUTO")

Coordinate systems

Coordinate systems are one of the more complicated corners of ggplot. To start with something simple, here’s how to flip axes:

unflipped <- ggplot(data = met_avg) + 
  geom_boxplot(mapping = aes(x = vis_cat, y = temp))

flipped <- ggplot(data = met_avg) + 
  geom_boxplot(mapping = aes(x = vis_cat, y = temp)) +
  coord_flip()

Coordinate systems

plot_grid(unflipped, flipped, labels = "AUTO")

Coordinate systems

There is also the ability to control the aspect ratio using coord_quickmap() and to use polar coordinates with coord_polar().

bar <- ggplot(data = met_avg) + 
  geom_bar(mapping = aes(x = elev_cat, fill = elev_cat), show.legend = FALSE, width = 1) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

Coordinate systems

bar + coord_flip()

bar + coord_polar()

Modifying labels

ggplot(met_avg[!is.na(region)]) +
  geom_point(aes(temp, rh, color = region)) + 
  labs(title = "Weather Station Data") + 
  labs(x = expression("Temperature" *~ degree * C), y = "Relative Humidity")

Changing the Theme

ggplot(met_avg[!is.na(region)]) +
  geom_point(aes(temp, rh, color = region)) + 
  labs(title = "Weather Station Data") + 
  labs(x = expression("Temperature"*~degree*C), y = "Relative Humidity")+
  theme_bw(base_family = "Times")

Changing the Legend

ggplot(met_avg[!is.na(region)]) +
  geom_point(aes(temp, rh, color = region)) + 
  labs(title = "Weather Station Data",x = expression("Temperature"*~degree*C), y = "Relative Humidity")+
  scale_color_manual(name="Region", labels=c("East", "West"), values=c("east"="lightblue", "west"="purple"))+
  theme_bw(base_family = "Times")

Changing Colorscales

ggplot(data = met_avg) + 
  geom_point(mapping=aes(x=temp, y=rh, color=elev))+
  scale_color_gradient(low="blue", high="red")

Changing Colorscales

ggplot(data=met_avg) + 
  geom_point(mapping= aes(x=temp, y=rh, color = cut(elev, b=5))) + 
  scale_color_manual(values = viridis::viridis(6))

A Great reference

A great (comprehensive) reference for everything you can do with ggplot2 is the R Graphics Cookbook:

https://r-graphics.org/

Reminder - the ggplot2 cheatsheet

A briefer summary can be found here:

https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

Rstudio has a variety of other great Cheatsheets.

Maps with leaflet

Let’s create a map of monthly average temperatures at each of the weather stations and colour the points by a temperature gradient. We need to create a colour palette and we can add a legend.

library(leaflet)
met_avg2 <- met[,.(temp = mean(temp,na.rm=TRUE), lat = mean(lat), lon = mean(lon)),  by=c("USAFID")]
met_avg2 <- met_avg2[!is.na(temp)]

# Generating a color palette
temp.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=met_avg2$temp)
temp.pal
function (x) 
{
    if (length(x) == 0 || all(is.na(x))) {
        return(pf(x))
    }
    if (is.null(rng)) 
        rng <- range(x, na.rm = TRUE)
    rescaled <- scales::rescale(x, from = rng)
    if (any(rescaled < 0 | rescaled > 1, na.rm = TRUE)) 
        warning("Some values were outside the color scale and will be treated as NA")
    if (reverse) {
        rescaled <- 1 - rescaled
    }
    pf(rescaled)
}
<bytecode: 0x7fc652072ac8>
<environment: 0x7fc6520750b0>
attr(,"colorType")
[1] "numeric"
attr(,"colorArgs")
attr(,"colorArgs")$na.color
[1] "#808080"

Maps with leaflet

For the tile providers, take a look at this site: https://leaflet-extras.github.io/leaflet-providers/preview/

tempmap <- leaflet(met_avg2) |> 
  # The looks of the Map
  addProviderTiles('CartoDB.Positron') |> 
  # Some circles
  addCircles(
    lat = ~lat, lng=~lon,
                                                  # HERE IS OUR PAL!
    label = ~paste0(round(temp,2), ' C'), color = ~ temp.pal(temp),
    opacity = 1, fillOpacity = 1, radius = 500
    ) |>
  # And a pretty legend
  addLegend('bottomleft', pal=temp.pal, values=met_avg2$temp,
          title='Temperature, C', opacity=1)

Maps with leaflet

tempmap