class: center, middle, title-slide .title[ # Week 7: Scraping, APIs, and Regular Expressions ] .subtitle[ ## PM 566: Introduction to Health Data Science ] .author[ ### George G. Vega Yon ] --- ## Today's goals - Introduction to Regular Expressions - Understand the fundamentals of Web Scraping - Learn how to use an API --- ## Regular Expressions: What is it? > A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. -- [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression) <img src="https://imgs.xkcd.com/comics/regular_expressions.png" width="450px"> --- ## Regular Expressions: Why should you care? We can use Regular Expressions for: - Validating data fields, email address, numbers, etc. - Searching text in various formats, e.g., addresses, there are many ways to write an address. - Replace text, e.g., different spellings, `Storm`, `Stôrm`, `Stórm` to `Storm`. - Remove text, e.g., tags from an HTML text, `<name>George</name>` to `George`. --- ## Regular Expressions 101: Metacharacters What makes *regex* special is metacharacters. While we can always use *regex* to match literals like `dog`, `human`, `1999`, we only make use of all *regex* power when using metacharacters: - `.` Any character except new line - `^` beginning of the text - `$` end of the text - `[regex]` Match a single character in `regex`, e.g. - `[0123456789]` Any number - `[0-9]` Any number in the range 0-9 - `[a-z]` Lower-case letters - `[A-Z]` Upper-case letters - `[a-zA-Z]` Lower or upper case letters. - `[a-zA-Z0-9]` Any alpha-numeric - `[^regex]` Match any except those in `regex`, e.g. - `[^0123456789]` Match any except a number - `[^0-9]` Match anything except in the range 0-9 - `[^./ ]` any except dot, slash, and space. --- ## Regular Expressions 101: Metacharacters (cont. 1) Ranges, e.g., `0-9` or `a-z`, are locale- and implementation-dependent, meaning that the range of lower case letters may vary depending on the OS's language. To solve for this problem, you could use [Character classes](https://en.wikipedia.org/wiki/Regular_expression#Character_classes). Some examples: - `[:lower:]` lower case letters in the current locale, could be `[a-z]` - `[:upper:]` upper case letters in the current locale, could be `[A-Z]` - `[:alpha:]` upper and lower case letters in the current locale, could be `[a-zA-Z]` - `[:digit:]` Digits: 0 1 2 3 4 5 6 7 8 9 - `[:alnum:]` Alpha numeric characters `[:alpha:]` and `[:digit:]`. - `[:punct:]` Punctuation characters: ! " \# $ % & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ _ ` \{ | \} ~. For example, in the locale `en_US`, the word `Ḧóla` IS NOT fully matched by `[a-zA-Z]+`, but IT IS fully matched by `[[:alpha:]]+`. Other important Metacharacters: - `\s` white space, equivalent to `[\r\n\t\f\v ]` - `|` or (logical or). --- ## Regular Expressions 101: Metacharacters (cont. 2) These usually come together with specifying how many times (repetition): - `regex?` Zero or one match. - `regex*` Zero or more matches - `regex+` One or more matches - `regex{n,}` At least `n` matches - `regex{,m}` at most `m` matches - `regex{n,m}` Between `n` and `m` matches. Where `regex` is a regular expression --- ## Regular Expressions 101: Metacharacters (cont. 3) There are other operators that can be very useful, - `(regex)` Group capture. - `(?:regex)` Group operation without capture. - `(?=regex)` Look ahead (match) - `(?!regex)` Look ahead (don't match) - `(?<=regex)` Look behind (match) - `(?<!regex)` Look behind (don't match) Group captures can be reused with `\1`, `\2`, ..., `\n`. More (great) information here https://regex101.com/ --- ## Regular Expressions 101: Examples Here we are extracting the first occurrence of the following regular expressions (using `stringr::str_extract()`): <table> <thead> <tr> <th style="text-align:left;"> regex </th> <th style="text-align:left;"> Hanna Perez [name] </th> <th style="text-align:left;"> The 年 year was 1999 </th> <th style="text-align:left;"> HaHa, @abc said that </th> <th style="text-align:left;"> GoGo trojans #2020! </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> .{5} </td> <td style="text-align:left;"> Hanna </td> <td style="text-align:left;"> The 年 </td> <td style="text-align:left;"> HaHa, </td> <td style="text-align:left;"> GoGo </td> </tr> <tr> <td style="text-align:left;"> n{2} </td> <td style="text-align:left;"> nn </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> [0-9]+ </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 2020 </td> </tr> <tr> <td style="text-align:left;"> \s[a-zA-Z]+\s </td> <td style="text-align:left;"> Perez </td> <td style="text-align:left;"> year </td> <td style="text-align:left;"> said </td> <td style="text-align:left;"> trojans </td> </tr> <tr> <td style="text-align:left;"> \s[[:alpha:]]+\s </td> <td style="text-align:left;"> Perez </td> <td style="text-align:left;"> 年 </td> <td style="text-align:left;"> said </td> <td style="text-align:left;"> trojans </td> </tr> <tr> <td style="text-align:left;"> [a-zA-Z]+ [a-zA-Z]+ </td> <td style="text-align:left;"> Hanna Perez </td> <td style="text-align:left;"> year was </td> <td style="text-align:left;"> abc said </td> <td style="text-align:left;"> GoGo trojans </td> </tr> <tr> <td style="text-align:left;"> ([a-zA-Z]+\s?){2} </td> <td style="text-align:left;"> Hanna Perez </td> <td style="text-align:left;"> The </td> <td style="text-align:left;"> HaHa </td> <td style="text-align:left;"> GoGo trojans </td> </tr> <tr> <td style="text-align:left;"> ([a-zA-Z]+)\1 </td> <td style="text-align:left;"> nn </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> HaHa </td> <td style="text-align:left;"> GoGo </td> </tr> <tr> <td style="text-align:left;"> (@|#)[a-z0-9]+ </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> @abc </td> <td style="text-align:left;"> #2020 </td> </tr> <tr> <td style="text-align:left;"> (?<=#|@)[a-z0-9]+ </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> abc </td> <td style="text-align:left;"> 2020 </td> </tr> <tr> <td style="text-align:left;"> \[[a-z]+\] </td> <td style="text-align:left;"> [name] </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- ## Regular Expressions 101: Examples (cont. 1) 1. .{5} Match **any character** (except line end) five times. 2. n{2} Match the letter **n** twice. 3. [0-9]+ Match **any number** at least once 4. \s[a-zA-Z]+\s Match a **space**, **any lower or upper case letter** at least once, and a **space**. 5. \s[[:alpha:]]+\s Same as before but this time . 6. [a-zA-Z]+ [a-zA-Z]+ Match two sets of letters separated by one space. 7. ([a-zA-Z]+\s?){2} Match **any lower or upper case letter** at least once, maybe followed by a white space, twice. 8. ([a-zA-Z]+)\1 Match **any lower or upper case letter** at least once, and then match the same pattern again. 9. (@|#)[a-z0-9]+ Match either the `@` or `#` symbol, followed by one or more **lower case letter** or **number**. 10. (?<=#|@)[a-z0-9]+ Match one or more **lower case letter** or **number** that follows either the `@` or `#` symbol. 11. \\ [[a-z]+\\ ] Match the symbol `[`, at least one **lower case letter**, and the symbol `]`. --- ## Regular Expressions 101: Functions in R 1. Lookup text: `base::grepl()`, `stringr::str_detect()`. 2. Similar to `which()`, which elements are `TRUE` `base::grep()`, `stringr::str_which()` 3. Replace the first instance: `base::sub()`, `stringr::str_replace()` 4. Replace all instances: `base::gsub()`, `stringr::str_replace_all()` 5. Extract text: `base::regmatches()`, `stringr::str_extract()` and `stringr::str_extract_all()`. --- ## Regular Expressions 101: Functions in R (cont.) For example, like in Twitter, let's create a regex that matches usernames or hashtags with the following pattern: `(@|#)([[:alnum:]]+)` |Code |@Hanna Perez [name] #html |The @年 year was 1999 |HaHa, @abc said that @z | |:-----------------------------------------------------|:----------------------------------------|:-------------------------------|:-------------------------------------------| |`str_detect(text, pattern)` or `grepl(pattern, text)` |TRUE |TRUE |TRUE | |`str_extract(text, pattern)` |@Hanna |@年 |@abc | |`str_extract_all(text, pattern)` |[@Hanna, #html] |[@年] |[@abc, @z] | |`str_replace(text, pattern, "\1justinbieber")` |@justinbieber Perez [name] #html |The @justinbieber year was 1999 |HaHa, @justinbieber said that @z | |`str_replace_all(text, pattern, "\1justinbieber")` |@justinbieber Perez [name] #justinbieber |The @justinbieber year was 1999 |HaHa, @justinbieber said that @justinbieber | **Note**: While it is not showing in the table, the group replacement was scaped, i.e., `\\1` instead of `\1` in the code. --- ## Data This week we will continue using Textmining dataset (together with the `data.table` and `stringr` packages) ```r library(data.table) library(stringr) fn <- "mtsamples.csv" if (!file.exists(fn)) download.file( url = "https://github.com/USCbiostats/data-science-data/raw/master/00_mtsamples/mtsamples.csv", destfile = fn ) mtsamples <- fread(fn, sep = ",", header = TRUE) ``` --- ## Regex Lookup Text: Tumor We would like to see if this is tumor related entry. For that we can simply use the following code: ```r # How many entries contain the word tumor mtsamples[grepl("tumor", description, ignore.case = TRUE), .N] ``` ``` ## [1] 80 ``` ```r # Generating a column tagging tumor mtsamples[, tumor_related := grepl("tumor", description, ignore.case = TRUE)] # Taking a look at a few examples mtsamples[tumor_related == TRUE, .(description)][1:3,] ``` ``` ## description ## 1: Transurethral resection of a medium bladder tumor (TURBT), left lateral wall. ## 2: Transurethral resection of the bladder tumor (TURBT), large. ## 3: Cystoscopy, transurethral resection of medium bladder tumor (4.0 cm in diameter), and direct bladder biopsy. ``` Notice the `ignore.case = TRUE`. This is equivalent to transforming the text to lower case using `tolower()` before passing the text to the regular expression function. --- ## Regex Lookup text: Pronoun of the patient Now, let's try to guess the pronoun of the patient. To do so, we could tag by using the words *he, his, him, they, them, theirs, ze, hir, hirs, she, hers, her* (see [this article on sexist text](https://dictionary.cambridge.org/grammar/british-grammar/sexist-language?q=He%2C+she%2C+him%2C+her%2C+his%2C+hers)): ```r mtsamples[, pronoun := str_extract( string = tolower(transcription), pattern = "he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her" )] ``` What is the problem with this approach? --- ## Regex Lookup text: Pronoun of the patient (cont. 1) For this we use the following regular expression: `(?<=\W|^)(he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her)(?=\W|$)` Bit by bit this is: - `(?<=regex)` lookback search. - `\W` any non-alphanumeric character, this is equivalent to `[^[:alnum:]]`, `|` or - `^` the beginning of text, - `he|his|him...` any of these words, - `(?=regex)` followed by, - `\W` any non-alphanumeric character, this is equivalent to `[^[:alnum:]]`, `|` or - `$` the end of the text. ```r mtsamples[, pronoun := str_extract( string = tolower(transcription), pattern = "(?<=\\W|^)(he|his|him|they|them|theirs|ze|hir|hirs|she|hers|her)(?=\\W|$)" )] mtsamples[1:10, pronoun] ``` ``` ## [1] "she" "he" "he" NA NA "she" "she" NA NA NA ``` --- ## Regex Lookup text: Pronoun of the patient (cont. 2) ```r mtsamples[, table(pronoun, useNA = "always")] ``` ``` ## pronoun ## he her him his she them they <NA> ## 1073 554 39 524 1261 23 96 1429 ``` --- ## Regex Extract Text: Type of Cancer - Imagine now that you need to see the types of Cancer mentioned in the data. - For simplicity, let's assume that, if specified, it is in the form of `TYPE cancer`, i.e. single word. - We are interested in the word before cancer, how can we capture this? --- ## Regex Extract Text: Type of Cancer (cont 1.) We can just try to **extract** the phrase `"[some word] cancer"`, in particular, we could use the following regular expression `[[:alnum:]-_]{4,}\s*cancer` Where - `[[:alnum:]-_]{4,}` captures any alphanumeric character, including `-` and `_`. Furthermore, for this match to work there must be at least 4 characters, - `\s*` captures 0 or more white-spaces, and - `cancer` captures the word cancer: ```r mtsamples[, cancer_type := str_extract(tolower(keywords), "[[:alnum:]-_]{4,}\\s*cancer")] mtsamples[, table(cancer_type)] ``` ``` ## cancer_type ## anal cancer bladder cancer breast cancer colon cancer ## 1 8 21 14 ## endometrial cancer esophageal cancer lung cancer ovarian cancer ## 5 2 13 1 ## papillary cancer prostate cancer uterine cancer ## 3 17 7 ``` --- ## Fundamentals of Web Scraping **What?** > Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites -- [Wikipedia](https://en.wikipedia.org/wiki/Web_scraping) **How?** - The [`rvest`](https://cran.r-project.org/package=rvest) R package provides various tools for reading and processing web data. - Under the hood, `rvest` is a wrapper of the [`xml2`](https://cran.r-project.org/package=xml2) and [`httr`](https://cran.r-project.org/package=httr) R packages. (in the case of [dynamic websites](https://en.wikipedia.org/wiki/Dynamic_web_page), take a look at [selenium](https://en.wikipedia.org/wiki/Selenium_(software))) --- ## Web scraping raw HTML: Example We would like to capture the table of COVID-19 death rates per country directly from Wikipedia. ```r library(rvest) library(xml2) # Reading the HTML table with the function xml2::read_html covid <- read_html( x = "https://en.wikipedia.org/wiki/COVID-19_pandemic_death_rates_by_country" ) # Let's look at the the output covid ``` ``` ## {html_document} ## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ... ``` --- ## Web scraping raw HTML: Example (cont 1.) - We want to get the HTML table that shows up in the doc. To do so, we can use the function `xml2::xml_find_all()` and `rvest::html_table()` - The first will locate the place in the document that matches a given **XPath** expression. - [XPath](https://en.wikipedia.org/wiki/XPath), XML Path Language, is a query language to select nodes in a XML document. - A nice tutorial can be found [here](https://www.w3schools.com/xml/xpath_intro.asp) - Modern Web browsers make it easy to use XPath! Live Example! (inspect elements in [Google Chrome](https://developers.google.com/web/tools/chrome-devtools/open), [Mozilla Firefox](https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector/How_to/Open_the_Inspector), [Internet Explorer](https://docs.microsoft.com/en-us/microsoft-edge/devtools-guide-chromium/ie-mode), and [Safari](https://developer.apple.com/library/archive/documentation/NetworkingInternetWeb/Conceptual/Web_Inspector_Tutorial/EditingCode/EditingCode.html#//apple_ref/doc/uid/TP40017576-CH4-DontLinkElementID_25)) --- ## Web scraping with `xml2` and the `rvest` package (cont. 2) Now that we know the path, let's use that and extract ```r table <- xml_find_all(covid, xpath = '//*[@id="covid-19-pandemic-cases-and-mortality-by-country"]/div[5]/table') table <- html_table(table) # This returns a list of tables head(table[[1]]) ``` ``` ## # A tibble: 6 × 4 ## Country `Deaths / million` Deaths Cases ## <chr> <chr> <chr> <chr> ## 1 World[a] 872 6,960,770 771,150,460 ## 2 Peru 6,511 221,704 4,520,727 ## 3 Bulgaria 5,664 38,414 1,302,188 ## 4 Bosnia and Herzegovina 5,057 16,354 403,155 ## 5 Hungary 4,896 48,807 2,206,311 ## 6 North Macedonia 4,750 9,946 349,104 ``` --- ## Web APIs **What?** > A Web API is an application programming interface for either a web server or a web browser. -- [Wikipedia](https://en.wikipedia.org/wiki/Web_API) Some examples include: [twitter API](https://developer.twitter.com/en), [facebook API](https://developers.facebook.com/), [Gene Ontology API](http://api.geneontology.org/api) **How?** You can request data, the **GET method**, post data, the **POST method**, and do many other things using the [HTTP protocol](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol). **How in R?** For this part, we will be using the `httr()` package, which is a wrapper of the `curl()` package, which in turn provides access to the `curl` library that is used to communicate with APIs. --- ## Web APIs with curl <div align="center"> <img src="https://cdn.tutsplus.com/net/authors/jeremymcpeak/http1-url-structure.png" width="700px"> <br> Structure of a URL (source: <a href="https://code.tutsplus.com/tutorials/http-the-protocol-every-web-developer-must-know-part-1--net-31177" target="_blank">"HTTP: The Protocol Every Web Developer Must Know - Part 1"</a>) </div> --- ## Web APIs with curl Under the hood, the `httr` (and thus `curl`) sends request somewhat like this ```bash curl -X GET https://google.com -w "%{content_type}\n%{http_code}\n" ``` A get request (`-X GET`) to `https://google.com`, which also includes (`-w`) the following: `content_type` and `http_code`: ```html <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="https://www.google.com/">here</A>. </BODY></HTML> text/html; charset=UTF-8 301 ``` We use the `httr` R package to make life easier. --- ## Web API Example 1: Gene Ontology - We will make use of the [Gene Ontology API](http://api.geneontology.org/api). - We want to know what genes (human or not) are **involved in** the function **antiviral innate immune response** (go term [GO:0140374](http://amigo.geneontology.org/amigo/term/GO:0140374)), looking only at those annotations that have evidence code [ECO:0000006](https://evidenceontology.org/browse/#ECO_0000006) (experimental evidence): ```r library(httr) go_query <- GET( url = "http://api.geneontology.org/", path = "api/bioentity/function/GO:0140374/genes", query = list( evidence = "ECO:0000006", relationship_type = "involved_in" ), # May need to pass this option to curl to allow to wait for at least # 60 seconds before returning error. config = config( connecttimeout = 60 ) ) ``` We could have also passed the full URL directly... --- ## Web API Example 1: Gene Ontology (cont. 1) Let's take a look at the curl call: ```bash curl -X GET "http://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in" -H "accept: application/json" ``` What `httr::GET()` does: ```r > go_query$request ## <request> ## GET http://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in ## Output: write_memory ## Options: ## * useragent: libcurl/7.58.0 r-curl/4.3 httr/1.4.1 ## * connecttimeout: 60 ## * httpget: TRUE ## Headers: ## * Accept: application/json, text/xml, application/xml, */* ``` --- ## Web API Example 1: Gene Ontology (cont. 2) Let's take a look at the response: ```r go_query ``` ``` ## Response [https://api.geneontology.org/api/bioentity/function/GO:0140374/genes?evidence=ECO%3A0000006&relationship_type=involved_in] ## Date: 2023-10-06 02:48 ## Status: 200 ## Content-Type: application/json ## Size: 418 kB ``` Remember the codes: - 1xx: Information message - 2xx: Success - 3xx: Redirection - 4xx: Client error - 5xx: Server error --- ## Web API Example 1: Gene Ontology (cont. 3) We can extract the results using the `httr::content()` function ```r dat <- content(go_query) dat <- lapply(dat$associations, function(a) { data.frame( Gene = a$subject$id, taxon_id = a$subject$taxon$id, taxon_label = a$subject$taxon$label ) }) dat <- do.call(rbind, dat) str(dat) ``` ``` ## 'data.frame': 390 obs. of 3 variables: ## $ Gene : chr "UniProtKB:A0A5F8GBV6" "UniProtKB:H2XPN7" "UniProtKB:E2QXT4" "UniProtKB:Q9GLV6" ... ## $ taxon_id : chr "NCBITaxon:13616" "NCBITaxon:7719" "NCBITaxon:9615" "NCBITaxon:9823" ... ## $ taxon_label: chr "Monodelphis domestica" "Ciona intestinalis" "Canis lupus familiaris" "Sus scrofa" ... ``` --- ## Web API Example 1: Gene Ontology (cont. 4) The structure of the result will depend on the API. In this case, the output was a JSON file, so the content function returns a list in R. In other scenarios it could return an XML object (we will see more in the lab) ```r knitr::kable(head(dat), caption = "Genes experimentally annotated with the function\ **antiviral innate immune response** (GO:0140374)" ) ``` Table: Genes experimentally annotated with the function **antiviral innate immune response** (GO:0140374) |Gene |taxon_id |taxon_label | |:--------------------|:---------------|:----------------------| |UniProtKB:A0A5F8GBV6 |NCBITaxon:13616 |Monodelphis domestica | |UniProtKB:H2XPN7 |NCBITaxon:7719 |Ciona intestinalis | |UniProtKB:E2QXT4 |NCBITaxon:9615 |Canis lupus familiaris | |UniProtKB:Q9GLV6 |NCBITaxon:9823 |Sus scrofa | |UniProtKB:Q5QSL2 |NCBITaxon:8022 |Oncorhynchus mykiss | |UniProtKB:B5XBW9 |NCBITaxon:8030 |Salmo salar | --- ## Web API Example 2: Using Tokens - Sometimes, APIs are not completely open, you need to register. - The API may require to login (user+password), or pass a token. - In this example, I'm using a token which I obtained [here](https://www.ncdc.noaa.gov/cdo-web/token) - You can find information about the [National Centers for Environmental Information](https://www.ncdc.noaa.gov/) API [here](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) --- ## Web API Example 2: Using Tokens (cont. 1) - The way to pass the token will depend on the API service. - Some require authentication, others need you to pass it as an argument of the query, i.e., directly in the URL. - In this case, we pass it on the header. ```r stations_api <- GET( url = "https://www.ncdc.noaa.gov", path = "cdo-web/api/v2/stations", config = add_headers( token = "[YOUR TOKEN HERE]" ), query = list(limit = 1000) ) ``` This is equivalent to using the following query ```bash curl --header "token: [YOUR TOKEN HERE]" \ https://www.ncdc.noaa.gov/cdo-web/api/v2/stations?limit=1000 ``` **Note**: This won't run, you need to get your own token --- ## Web API Example 2: Using Tokens (cont. 2) Again, we can recover the data using the `content()` function: ```r ans <- content(stations_api) ans$results[[1]] ## $elevation ## [1] 139 ## ## $mindate ## [1] "1948-01-01" ## ## $maxdate ## [1] "2014-01-01" ## ## $latitude ## [1] 31.5702 ## ## $name ## [1] "ABBEVILLE, AL US" ## ## $datacoverage ## [1] 0.8813 ## ## $id ## [1] "COOP:010008" ``` --- ## Web API Example 3: HHS health recommendation Here is a last example. We will use the Department of Health and Human Services API for "[...] demographic-specific health recommendations" (details at [health.gov](https://health.gov/our-work/health-literacy/consumer-health-content/free-web-content/apis-developers/documentation)) ```r health_advises <- GET( url = "https://health.gov/", path = "myhealthfinder/api/v3/myhealthfinder.json", query = list( lang = "en", age = "32", sex = "male", tobaccoUse = 0 ), config = c( add_headers(accept = "application/json"), config(connecttimeout = 60) ) ) ``` --- ## Web API Example 3: HHS health recommendation (cont. 1) Let's see the response ```r health_advises ``` ``` ## Response [https://health.gov/myhealthfinder/api/v3/myhealthfinder.json?lang=en&age=32&sex=male&tobaccoUse=0] ## Date: 2023-10-06 02:49 ## Status: 200 ## Content-Type: application/json ## Size: 365 kB ## { ## "Result": { ## "Error": "False", ## "Total": 18, ## "Query": { ## "ApiVersion": "3", ## "ApiType": "myhealthfinder", ## "TopicId": null, ## "ToolId": null, ## "CategoryId": null, ## ... ``` --- ## Web API Example 3: HHS health recommendation (cont. 2) ```r # Extracting the content health_advises_ans <- content(health_advises) # Getting the titles txt <- with(health_advises_ans$Result$Resources, c( sapply(all$Resource, "[[", "Title"), sapply(some$Resource, "[[", "Title"), sapply(`You may also be interested in these health topics:`$Resource, "[[", "Title") )) cat(txt, sep = "; ") ``` Hepatitis C Screening: Questions for the Doctor; Quit Smoking; Protect Yourself from Seasonal Flu; Talk with Your Doctor About Depression; Drink Alcohol Only in Moderation; Get Vaccines to Protect Your Health (Adults Ages 19 to 49); Get Tested for HIV; Get Your Blood Pressure Checked; Talk with Your Doctor About Drug Misuse; Aim for a Healthy Weight; Eat Healthy; Testing for Syphilis: Questions for the Doctor; Protect Yourself from Hepatitis B; Testing for Latent Tuberculosis: Questions for the Doctor; Manage Stress; Alcohol Use: Conversation Starters; Get Active; Quitting Smoking: Conversation Starters --- ## Summary - We learned about regular expressions with the package **stringr** (a wrapper of **stringi**) - We can use regular expressions to detect (`str_detect()`), replace (`str_replace()`), and extract (`str_extract()`) expressions. - We looked at web scraping using the **rvest** package (a wrapper of **xml2**). - We extracted elements from the HTML/XML using `xml_find_all()` with XPath expressions. - We also used the `html_table()` function from rvest to extract tables from HTML documents. - We took a quick review on Web APIs and the Hyper-text-transfer-protocol (HTTP). - We used the **httr** R package (wrapper of **curl**) to make `GET` requests to various APIs - We even showed an example using a token passed via the `header`. - Once we got the responses, we used the `content()` function to extract the message of the response. --- ## Detour on CURL options Sometimes you will need to change the default set of options in CURL. You can checkout the list of options in `curl::curl_options()`. A common hack is to extend the time-limit before dropping the conection, e.g.: Using the **Health IT** API from the US government, we can obtain the **Electronic Prescribing Adoption and Use by County** (see docs [here](https://dashboard.healthit.gov/datadashboard/documentation/electronic-prescribing-adoption-use-data-documentation-county.php)) The problem is that it usually takes longer to get the data, so we pass the config option `connecttimeout` (which corresponds to the flag `--connect-timeout`) in the curl call (see next slide) --- ## Detour on CURL options (cont.) ```r ans <- httr::GET( url = "https://dashboard.healthit.gov/api/open-api.php", query = list( source = "AHA_2008-2015.csv", region = "California", period = 2015 ), config = config( connecttimeout = 60 ) ) ``` ```r > ans$request # <request> # GET https://dashboard.healthit.gov/api/open-api.php?source=AHA_2008-2015.csv®ion=California&period=2015 # Output: write_memory # Options: # * useragent: libcurl/7.58.0 r-curl/4.3 httr/1.4.1 # * connecttimeout: 60 # * httpget: TRUE # Headers: # * Accept: application/json, text/xml, application/xml, */* ``` --- ## Regular Expressions: Email validation This is the official regex for email validation implemented by [RCF 5322](http://www.ietf.org/rfc/rfc5322.txt) ``` (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08 \x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(? :[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[ 0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0 -9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\ x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]) ``` See the corresponding post in [StackOverflow](https://stackoverflow.com/a/201378/2097171)