# Downloading the website
<- xml2::read_html("[URL]")
website
# Finding the counts
<- xml2::xml_find_first(website, "[XPath]")
counts
# Turning it into text
<- as.character(counts)
counts
# Extracting the data using regex
<- stringr::str_extract(counts, "[REGEX FOR NUMBERS WITH COMMAS/DOTS]")
totalcount
# Removing any commas/dots so that we can convert to numeric
<- gsub('[REGEX FOR COMMAS/DOTS]', '', totalcount)
totalcount
<- as.numeric(totalcount)
totalcount print(totalcount)
Lab 7 - Web Scraping and Regular Expressions
Learning goals
- Use real-time data pulled from the internet.
- Use regular expressions to parse the information.
- Practice your GitHub skills.
Lab description
In this lab, we will be working with the NCBI API to make queries and extract information using XML and regular expressions. For this lab, we will be using the httr
, xml2
, and stringr
R packages.
Question 1: How many Sars-Cov-2 papers?
We will build an automatic counter of Sars-Cov-2 papers available on PubMed. You will need to apply XPath as we did during the lecture to extract the number of results returned by PubMed when you search for “sars-cov-2.”
The following URL will perform the search: https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2
And you can find the total number of results in the top left corner of the search results.
Complete the lines of code:
Question 3: Distribution of universities
Look through the first couple abstracts and see how the author affiliations are formatted.
Using the function stringr::str_extract_all()
applied on abstracts
, capture all the terms of the form:
- “… University”
- “University of …”
- “… Institute of …”
Write a regular expression that captures all such instances.
library(stringr)
<- str_extract_all(
institution
abstracts,"[REGEX FOR INSTITUTIONS]"
) <- unlist(institution)
institution table(institution)
What are the 10 most common institutions? Discuss how you could you improve these results.
Question 4: Make a tidy dataset
We want to build a dataset which includes the journal, article title, authors, and affiliations for each paper. In order to do this, we will go back to the original input, but not do all the formatting we did previously.
# read in text, each line is a separate character
<- readLines('~/Downloads/abstract-sars-cov-2-set.txt', warn = FALSE)
abstracts # combine all text into one character
<- paste(abstracts, collapse = '\n')
abstracts # split the text whenever 3 new lines occur in a row (indicating two blank lines)
<- unlist(strsplit(abstracts, split = '\n\n\n')) abstracts
Now extract the journal title for each article. This is in the first line of each entry, immediately after the citation number (ie. “1.”). Notice that we are using str_extract
rather than str_extract_all
, because we only want (at most) one result per entry.
<- str_extract(abstracts, "[YOUR REGULAR EXPRESSION]") journal
Now we’re going to extract the title of each article. This is the second non-empty line in each entry, with blank lines before and after. We could do this with a regular expression, but instead, we’ll use the strsplit()
function, as we did above.
<- sapply(abstracts, function(x){
titles unlist(strsplit(x, split = "\n\n"))[2]
USE.NAMES = FALSE) },
Note that each title could still contain a single \n
symbol, in the case of a particularly long title. We could remove those as we did the first time we read in the data.
Use the same technique to extract the list of authors and call this object authors
.
Now use a regular expression to extract the author affiliations section for each abstract. This always starts with a blank line followed by “Author information:” and ends with another blank line.
<- str_extract(abstracts, "[YOUR REGULAR EXPRESSION]") affiliations
Finally, put everything together into a single data.frame
and use knitr::kable
to print the first five results.
<- data.frame(
papers
[DATA TO COMBINE]
)::kable(papers[1:5, ]) knitr
Done! Render the document, commit, and push.