Assignment 03 - Text Mining
Due Date
This assignment is due by 11:59pm Pacific Time, November 8th, 2024.
Text Mining
A new dataset has been added to the data science data repository
https://github.com/USCbiostats/data-science-data/tree/master/03_pubmed. The dataset contains 3,241 abstracts from articles collected via 5 PubMed searches. The search terms are listed in the second column, term
and these will serve as the “documents.” Your job is to analyse these abstracts to find interesting insights.
- Tokenize the abstracts and count the number of each token. Do you see anything interesting? Does removing stop words change what tokens appear as the most frequent? What are the 5 most common tokens for each search term after removing stopwords?
- Tokenize the abstracts into bigrams. Find the 10 most common bigrams and visualize them with ggplot2.
- Calculate the TF-IDF value for each word-search term combination (here you want the search term to be the “document”). What are the 5 tokens from each search term with the highest TF-IDF value? How are the results different from the answers you got in question 1?
Sentiment Analysis
- Perform a sentiment analysis using the NRC lexicon. What is the most common sentiment for each search term? What if you remove
"positive"
and"negative"
from the list? - Now perform a sentiment analysis using the AFINN lexicon to get an average positivity score for each abstract (hint: you may want to create a variable that indexes, or counts, the abstracts). Create a visualization that shows these scores grouped by search term. Are any search terms noticeably different from the others?