::opts_chunk$set(eval = FALSE, include = TRUE) knitr
Lab 6 - Text Mining
Learning goals
- Use
unnest_tokens()
andunnest_ngrams()
to extract tokens and n-grams from text. - Use dplyr and ggplot2 to analyze text data
Lab description
For this lab we will be working with a new text-based dataset consisting of transcribed medical reports. The dataset contains transcription samples from https://www.mtsamples.com/. We have created a (somewhat) cleaned version of the data, which can be found at https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv.
Setup packages
You should load in dplyr
, ggplot2
and tidytext
. If you don’t already have tidytext
then you can install with
install.packages("tidytext")
Read in Medical Transcriptions
Loading in the cleaned data from the USCbiostats/data-science-data
repository:
library(readr)
library(dplyr)
<- read_csv("https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv")
mt_samples <- mt_samples |>
mt_samples select(description, medical_specialty, transcription)
head(mt_samples)
Question 1: What specialties do we have?
Use the count()
function from dplyr
to figure out how many different categories we have in the data. Are these categories related? Overlapping? Evenly distributed?
%>%
mt_samples count(___, sort = TRUE)
Question 2
- Tokenize the the words in the
transcription
column - Count the number of times each token appears
- Visualize the top 20 most frequent words
Explain what we see from this result. Does it makes sense? What insights (if any) do we get?
Question 3
- Re-do the visualization but remove stop words before making it
- Bonus points if you remove numbers as well
What do we see know that we have removed stop words? Does it give us a better idea of what the text is about?
Question 4
Repeat question 2, but this time tokenize into bi-grams. How does the result change if you look at tri-grams?
Question 5
Using the results you got from Question 4, pick a word and count the words that appear before and after it.
Question 6
Which words are most used in each of the specialties? You can use group_by()
and top_n()
from dplyr
to have the calculations be done within each specialty. Remember to remove stop words. What are the 5 most-used words for each specialty?
Question 7 - extra
Find your own insight in the data:
Ideas:
- Interesting n-grams
- See if certain words are used more in some specialties than others