A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. – Wikipedia
Regular Expressions: Why should you care?
We can use Regular Expressions for:
Validating data fields, email address, numbers, etc.
Searching text in various formats, e.g., addresses, there are many ways to write an address.
Replace text, e.g., different spellings, Storm, Stôrm, Stórm to Storm.
Remove text, e.g., tags from an HTML text, <name>George</name> to George.
Regex 101: Metacharacters
What makes regex special is metacharacters. While we can always use regex to match literals like dog, human, 1999, we only make use of all regex power when using metacharacters:
. Any character except new line
^ beginning of the text
$ end of the text
[expression] Match any single character in “expression”, e.g.
[0123456789] Any digit
[0-9] Any digit in the range 0-9
[a-z] Lower-case letters
[A-Z] Upper-case letters
[a-zA-Z] Lower or upper case letters.
[a-zA-Z0-9] Any alpha-numeric
Regex 101: Metacharacters (cont. 1)
[^regex] Match any except those in regex, e.g.
[^0123456789] Match any except a number
[^0-9] Match anything except in the range 0-9
[^./ ] any except dot, slash, and space.
Regex 101: Metacharacters (cont. 2)
Ranges, e.g., 0-9 or a-z, are locale- and implementation-dependent, meaning that the range of lower case letters may vary depending on the OS’s language. To solve for this problem, you could use Character classes. Some examples:
[[:lower:]] lower case letters in the current locale, could be [a-z]
[[:upper:]] upper case letters in the current locale, could be [A-Z]
[[:alpha:]] upper and lower case letters in the current locale, could be [a-zA-Z]
[[:digit:]] Digits: 0 1 2 3 4 5 6 7 8 9
[[:alnum:]] Alpha numeric characters [[:alpha:]] and [[:digit:]].
.{5} Match any character (except line end) five times.
n{2} Match the letter n twice.
[0-9] Match any number once.
[0-9]+ Match any number at least once.
[a-zA-Z]+ Match any lower or upper case letter at least once.
\\s[a-zA-Z]+\\s Match a space, any lower or upper case letter at least once, and a space.
\\s[[:alpha:]]+\\s Same as before but this time with the [[:alpha:]] character class.
Regex 101: Examples (cont. 1)
Here we are extracting the first occurrence of the following regular expressions (using stringr::str_extract()):
regex
Hanna Perez [name]
The 年 year was 1999
HaHa, @abc said that
GoGo trojans #2025!
.{5}
Hanna
The 年
HaHa,
GoGo
n{2}
nn
[0-9]
1
2
[0-9]+
1999
2025
[a-zA-Z]+
Hanna
The
HaHa
GoGo
\\s[a-zA-Z]+\\s
Perez
year
said
trojans
\\s[[:alpha:]]+\\s
Perez
年
said
trojans
Regex 101: Examples (cont. 3)
[a-zA-Z]+ [a-zA-Z]+ Match two sets of letters separated by one space.
[a-zA-Z]+\\s? Match one set of letters, maybe followed by a white space.
([a-zA-Z]+)\\1 Match any lower or upper case letter at least once and then match the same pattern again.
(@|#)[a-z0-9]+ Match either the @ or # symbol, followed by one or more lower case letter or number.
(?<=#|@)[a-z0-9]+ Match one or more lower case letter or number that follows either the @ or # symbol.
\\[[a-z]+\\] Match the symbol [, at least one lower case letter, then symbol ].
Regex 101: Examples (cont. 2)
regex
Hanna Perez [name]
The 年 year was 1999
HaHa, @abc said that
GoGo trojans #2025!
[a-zA-Z]+ [a-zA-Z]+
Hanna Perez
year was
abc said
GoGo trojans
[a-zA-Z]+\\s?
Hanna
The
HaHa
GoGo
([a-zA-Z]+)\\1
nn
HaHa
GoGo
(@|#)[a-z0-9]+
@abc
#2025
(?<=#|@)[a-z0-9]+
abc
2025
Regex 101: Examples (cont. 3)
Regex 101: Functions in R
Find in text: grepl(), stringr::str_detect().
Similar to which(), which elements are TRUEgrep(), stringr::str_which()
Replace the first instance: sub(), stringr::str_replace()
Replace all instances: gsub(), stringr::str_replace_all()
Extract text: regmatches() plus regexpr(), stringr::str_extract() and stringr::str_extract_all().
Regex 101: Functions in R (cont. 1)
For example, like in Twitter, let’s create a regex that matches usernames or hashtags with the following pattern:
(@|#)([[:alnum:]]+)
Code
@Hanna Perez [name] #html
The @年 year was 1999
HaHa, @abc said that @z
str_detect(text, pattern) or grepl(pattern, text)
TRUE
TRUE
TRUE
str_extract(text, pattern) or regmatches(...)
@Hanna
@年
@abc
str_extract_all(text, pattern)
@Hanna, #html
@年
@abc, @z
Regex 101: Functions in R (cont. 2)
Pattern: (@|#)([[:alnum:]]+)
Code
@Hanna Perez [name] #html
The @年 year was 1999
HaHa, @abc said that @z
str_replace(text, pattern, "\\1justinbieber") or sub(...)
@justinbieber Perez [name] #html
The @justinbieber year was 1999
HaHa, @justinbieber said that @z
str_replace_all(text, pattern, "\\1justinbieber") or gsub(...)
@justinbieber Perez [name] #justinbieber
The @justinbieber year was 1999
HaHa, @justinbieber said that @justinbieber
Note: It renders oddly in the table, but there is no space or line break in the group replacement, \\1.
Data
This week we will continue using our text mining dataset:
# Where are we getting the data frommts_url <-"https://github.com/USCbiostats/data-science-data/raw/master/00_mtsamples/mtsamples.csv"# Downloading the data to a tempfile (so it is destroyed afterwards)tmp <-tempfile(pattern ="mtsamples", fileext =".csv")# We should be downloading this, ONLY IF this was not downloaded already.# otherwise is just a waste of time.if (!file.exists(tmp)) {download.file(url = mts_url,destfile = tmp,# method = "libcurl", timeout = 1000 (you may need this option) )}# read the filemtsamples <-read.csv(tmp, header =TRUE, row.names =1)
Regex Lookup Text: Tumor
For each entry, we want to know if it is tumor-related. For that we can use the following code:
# How many entries contain the word tumor?sum(grepl("tumor", mtsamples$description, ignore.case =TRUE))
[1] 80
# Generating a column tagging tumormtsamples$tumor_related <-grepl("tumor", mtsamples$description, ignore.case =TRUE)# Taking a look at a few examplesmtsamples$description[mtsamples$tumor_related ==TRUE][1:3]
[1] " Transurethral resection of a medium bladder tumor (TURBT), left lateral wall."
[2] " Transurethral resection of the bladder tumor (TURBT), large."
[3] " Cystoscopy, transurethral resection of medium bladder tumor (4.0 cm in diameter), and direct bladder biopsy."
Notice the ignore.case = TRUE. This is equivalent to transforming the text to lower case using tolower() before passing the text to the regular expression function.
Regex Lookup text: Pronoun of the patient
Now, let’s try to guess the pronoun of the patient. To do so, we could tag by using the words he, his, him, they, them, theirs, ze, hir, hirs, she, hers, her (see this article on sexist text):
Regex Lookup text: Pronoun of the patient (cont. 1)
mtsamples$transcription[1]
[1] "SUBJECTIVE:, This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals: Weight was 130 pounds and blood pressure 124/78.,HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.,Neck: Supple without adenopathy.,Lungs: Clear.,ASSESSMENT:, Allergic rhinitis.,PLAN:,1. She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.,2. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well."
mtsamples$pronoun[1]
[1] "his"
Regex Lookup text: Pronoun of the patient (cont. 1)
To correct this issue, we can make our regular expression more precise:
anal cancer bladder cancer breast cancer colon cancer
1 8 21 14
endometrial cancer esophageal cancer lung cancer ovarian cancer
5 2 13 1
papillary cancer prostate cancer uterine cancer
3 17 7
Web Scraping
Fundamentals of Web Scraping
What?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites – Wikipedia
How?
The rvest R package provides various tools for reading and processing web data.
Under the hood, rvest is a wrapper of the xml2 and httr R packages.
We would like to capture the table of COVID-19 death rates per country directly from Wikipedia.
library(rvest)library(xml2)# Reading the HTML table with the function xml2::read_htmlcovid <-read_html(x ="https://en.wikipedia.org/wiki/COVID-19_pandemic_death_rates_by_country" )# Let's look at the the outputcovid
You can request data, the GET method, post data, the POST method, and do many other things using the HTTP protocol.
How in R?
For this part, we will be using the httr() package, which is a wrapper of the curl() package, which in turn provides access to the curl library that is used to communicate with APIs.
We want to know what genes (human or not) are involved in the function antiviral innate immune response (go term GO:0140374), looking only at those annotations that have evidence code ECO:0000006 (experimental evidence):
library(httr)go_query <-GET(url ="http://api.geneontology.org/",path ="api/bioentity/function/GO:0140374/genes",query =list(evidence ="ECO:0000006",relationship_type ="involved_in" ), # May need to pass this option to curl to allow to wait for at least# 60 seconds before returning error.config =config(connecttimeout =60 ))
We could have also passed the full URL directly…
* Web API Example 1: Gene Ontology (cont. 1)
Let’s take a look at the curl call:
curl-X GET "http://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in"-H"accept: application/json"
The structure of the result will depend on the API. In this case, the output was a JSON file, so the content function returns a list in R. In other scenarios it could return an XML object.
knitr::kable(head(dat),caption ="Genes experimentally annotated with the function\ **antiviral innate immune response** (GO:0140374)" )
Genes experimentally annotated with the function antiviral innate immune response (GO:0140374)
Gene
taxon_id
taxon_label
UniProtKB:H9GKP0
NCBITaxon:28377
Anolis carolinensis
UniProtKB:H9GSI3
NCBITaxon:28377
Anolis carolinensis
UniProtKB:A0A287AMJ0
NCBITaxon:9823
Sus scrofa
UniProtKB:A0A287AKR1
NCBITaxon:9823
Sus scrofa
UniProtKB:F7EI59
NCBITaxon:9544
Macaca mulatta
UniProtKB:A0A6I8NTG1
NCBITaxon:9258
Ornithorhynchus anatinus
* Web API Example 2: Using Tokens
Sometimes, APIs are not completely open, you need to register.
The API may require to login (user+password), or pass a token.
In this example, I’m using a token which I obtained here
Here is a last example. We will use the Department of Health and Human Services API for “[…] demographic-specific health recommendations” (details at health.gov)
* Web API Example 3: HHS health recommendation (cont. 1)
Let’s see the response
health_advises
Response [https://odphp.health.gov/myhealthfinder/api/v3/myhealthfinder.json?lang=en&age=32&sex=male&tobaccoUse=0]
Date: 2025-08-29 14:55
Status: 200
Content-Type: application/json
Size: 310 kB
{
"Result": {
"Error": "MyHealthfinder API version 3 is no longer supported. Please...
"Total": 18,
"Query": {
"ApiVersion": "3",
"ApiType": "myhealthfinder",
"TopicId": null,
"CategoryId": null,
"Keyword": null,
...
* Web API Example 3: HHS health recommendation (cont. 2)
# Extracting the contenthealth_advises_ans <-content(health_advises)# Getting the titlestxt <-with(health_advises_ans$Result$Resources, c(sapply(all$Resource, "[[", "Title"),sapply(some$Resource, "[[", "Title"),sapply(`You may also be interested in these health topics:`$Resource, "[[", "Title")))cat(txt, sep ="; ")
Quit Smoking; Hepatitis C Screening: Questions for the Doctor; Protect Yourself from Seasonal Flu; Talk with Your Doctor About Depression; Get Your Blood Pressure Checked; Get Tested for HIV; Get Vaccines to Protect Your Health (Adults Ages 19 to 49 Years); Drink Alcohol Only in Moderation; Talk with Your Doctor About Drug Misuse and Substance Use Disorder; Aim for a Healthy Weight; Testing for Syphilis: Questions for the Doctor; Eat Healthy; Protect Yourself from Hepatitis B; Testing for Latent Tuberculosis: Questions for the Doctor; Manage Stress; Alcohol Use: Conversation Starters; Get Active; Quitting Smoking: Conversation Starters
Summary
We learned about regular expressions with the package stringr (a wrapper of stringi)
We can use regular expressions to detect (str_detect()), replace (str_replace()), and extract (str_extract()) expressions.
We looked at web scraping using the rvest package (a wrapper of xml2).
We extracted elements from the HTML/XML using xml_find_all() with XPath expressions.
We also used the html_table() function from rvest to extract tables from HTML documents.
We took a quick review on Web APIs and the Hyper-text-transfer-protocol (HTTP).
We used the httr R package (wrapper of curl) to make GET requests to various APIs
We even showed an example using a token passed via the header.
Once we got the responses, we used the content() function to extract the message of the response.
Detour on CURL options
Sometimes you will need to change the default set of options in CURL. You can checkout the list of options in curl::curl_options(). A common hack is to extend the time-limit before dropping the conection, e.g.:
Using the Health IT API from the US government, we can obtain the Electronic Prescribing Adoption and Use by County (see docs here)
The problem is that it usually takes longer to get the data, so we pass the config option connecttimeout (which corresponds to the flag --connect-timeout) in the curl call (see next slide)