Lab 7 - Web Scraping and Regular Expressions

Learning goals

  • Use real-time data pulled from the internet.
  • Use regular expressions to parse the information.
  • Practice your GitHub skills.

Lab description

In this lab, we will be working with the NCBI API to make queries and extract information using XML and regular expressions. For this lab, we will be using the httr, xml2, and stringr R packages.

Question 1: How many Sars-Cov-2 papers?

We will build an automatic counter of Sars-Cov-2 papers available on PubMed. You will need to apply XPath as we did during the lecture to extract the number of results returned by PubMed when you search for “sars-cov-2.”

The following URL will perform the search: https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2

And you can find the total number of results in the top left corner of the search results.

Complete the lines of code:

# Downloading the website
website <- xml2::read_html("[URL]")

# Finding the counts
counts <- xml2::xml_find_first(website, "[XPath]")

# Turning it into text
counts <- as.character(counts)

# Extracting the data using regex
totalcount <- stringr::str_extract(counts, "[REGEX FOR NUMBERS WITH COMMAS/DOTS]")

# Removing any commas/dots so that we can convert to numeric
totalcount <- gsub('[REGEX FOR COMMAS/DOTS]', '', totalcount)

totalcount <- as.numeric(totalcount)
print(totalcount)

Question 2: Get article abstracts and authors

That’s quite a few articles, so let’s narrow our focus by including “california” in our search:

https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2%20california

In your web browser, use the slider on the left to narrow your search down to just the years 2020 and 2021.

Now we will download the abstracts and author information for all of these articles. Under the search bar, click “Save,” set the Selection to “All results,” set the Format to “Abstract (text),” and click “Create file.” This should start downloading a large-ish text file (abstract-sars-cov-2-set.txt)

We can read this into R with the following code. We need to do some formatting because the abstracts are separated by two blank lines.

# read in text, each line is a separate character
abstracts <- readLines('~/Downloads/abstract-sars-cov-2-set.txt', warn = FALSE)
# combine all text into one character
abstracts <- paste(abstracts, collapse = '\n')
# split the text whenever 3 new lines occur in a row (indicating two blank lines)
abstracts <- unlist(strsplit(abstracts, split = '\n\n\n'))
# replace any remaining "\n" symbols with spaces
abstracts <- gsub("\n", " ", abstracts)
# replace multiple spaces with single space
abstracts <- gsub(" +", " ", abstracts)

Question 3: Distribution of universities

Look through the first couple abstracts and see how the author affiliations are formatted.

Using the function stringr::str_extract_all() applied on abstracts, capture all the terms of the form:

  1. “… University”
  2. “University of …”
  3. “… Institute of …”

Write a regular expression that captures all such instances.

library(stringr)
institution <- str_extract_all(
  abstracts,
  "[REGEX FOR INSTITUTIONS]"
  ) 
institution <- unlist(institution)
table(institution)

What are the 10 most common institutions? Discuss how you could you improve these results.

Question 4: Make a tidy dataset

We want to build a dataset which includes the journal, article title, authors, and affiliations for each paper. In order to do this, we will go back to the original input, but not do all the formatting we did previously.

# read in text, each line is a separate character
abstracts <- readLines('~/Downloads/abstract-sars-cov-2-set.txt', warn = FALSE)
# combine all text into one character
abstracts <- paste(abstracts, collapse = '\n')
# split the text whenever 3 new lines occur in a row (indicating two blank lines)
abstracts <- unlist(strsplit(abstracts, split = '\n\n\n'))

Now extract the journal title for each article. This is in the first line of each entry, immediately after the citation number (ie. “1.”). Notice that we are using str_extract rather than str_extract_all, because we only want (at most) one result per entry.

journal <- str_extract(abstracts, "[YOUR REGULAR EXPRESSION]")

Now we’re going to extract the title of each article. This is the second non-empty line in each entry, with blank lines before and after. We could do this with a regular expression, but instead, we’ll use the strsplit() function, as we did above.

titles <- sapply(abstracts, function(x){
  unlist(strsplit(x, split = "\n\n"))[2]
}, USE.NAMES = FALSE)

Note that each title could still contain a single \n symbol, in the case of a particularly long title. We could remove those as we did the first time we read in the data.

Use the same technique to extract the list of authors and call this object authors.

Now use a regular expression to extract the author affiliations section for each abstract. This always starts with a blank line followed by “Author information:” and ends with another blank line.

affiliations <- str_extract(abstracts, "[YOUR REGULAR EXPRESSION]")

Finally, put everything together into a single data.frame and use knitr::kable to print the first five results.

papers <- data.frame(
  [DATA TO COMBINE]
)
knitr::kable(papers[1:5, ])

Done! Render the document, commit, and push.