Assignment 03 - Text Mining

Due Date

This assignment is due by midnight Pacific Time, November 3rd, 2023.

Text Mining

A new dataset has been added to the data science data repository https://github.com/USCbiostats/data-science-data/tree/master/03_pubmed. The dataset contains 3,241 abstracts from articles collected via 5 PubMed searches. The search terms are listed in the second column, term and these will serve as the “documents.” Your job is to analyse these abstracts to find interesting insights.

  1. Tokenize the abstracts and count the number of each token. Do you see anything interesting? Does removing stop words change what tokens appear as the most frequent? What are the 5 most common tokens for each search term after removing stopwords?
  2. Tokenize the abstracts into bigrams. Find the 10 most common bigrams and visualize them with ggplot2.
  3. Calculate the TF-IDF value for each word-search term combination (here you want the search term to be the “document”). What are the 5 tokens from each search term with the highest TF-IDF value? How are the results different from the answers you got in question 1?