Assignment 03 - Text Mining
Due Date
This assignment is due by midnight Pacific Time, November 3rd, 2023.
Text Mining
A new dataset has been added to the data science data repository
https://github.com/USCbiostats/data-science-data/tree/master/03_pubmed. The dataset contains 3,241 abstracts from articles collected via 5 PubMed searches. The search terms are listed in the second column, term
and these will serve as the “documents.” Your job is to analyse these abstracts to find interesting insights.
- Tokenize the abstracts and count the number of each token. Do you see anything interesting? Does removing stop words change what tokens appear as the most frequent? What are the 5 most common tokens for each search term after removing stopwords?
- Tokenize the abstracts into bigrams. Find the 10 most common bigrams and visualize them with ggplot2.
- Calculate the TF-IDF value for each word-search term combination (here you want the search term to be the “document”). What are the 5 tokens from each search term with the highest TF-IDF value? How are the results different from the answers you got in question 1?