HW10: Analyzing text data

Nov 16, 2022 2 min read

Overview

Due by 11:59pm on November 22nd.

Accessing the `hw10` repository

Go here and find your copy of the hw10 repository. It follows the naming convention hw10-<USERNAME>. Clone the repository to your computer.

Your mission

Perform text analysis.

Okay, I need more information

Perform sentiment analysis, classification, or topic modeling using text analysis methods as demonstrated in class and in the readings.

Okay, I need some data sources

Some suggested text data you could use include:

gutenbergr
Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts
Data for Everyone - a bunch of open-source data sets. Some contain text data, such as New England Patriots Deflategate sentiment.
Hate speech samples
Last statements by Texas death row inmates
Movie Review Data - good for sentiment analysis
The musiXmatch Dataset
Scrape tweets using rtweet (you know how to use the API now, right?)
State of the Union speeches
- sotu - R package with all State of the Union speeches through 2016. Easier starting point.
Something from here (h/t Chris Bail)

How much do I really need to do?

Analyze the text for sentiment OR topic. Or build a statistical learning model using text features to predict some outcome of interest. You don’t have to do all these things, just pick one. The lecture notes and Tidy Text Mining with R are good starting points for templates to perform this type of analysis, but feel free to expand beyond these examples.

Submit the assignment

Your assignment should be submitted as an Quarto document using the gfm (GitHub Flavored Markdown) format. Whatever is necessary to show your code and present your results. Follow instructions on homework workflow.

Rubric

Needs improvement: Cannot get code to run or is poorly documented. Severe misinterpretations of the results. No effort is made to pre-process the text for analysis.¹

Satisfactory: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Excellent: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible (i.e. if analyzing tweets, you have stored a copy in a local file so I can exactly reproduce your results as well as run it on a new sample of tweets). Uses a sentiment analysis or topic model approach not directly covered in class.

Or you provide no justification for keeping content such as numbers, stop words, etc. ↩︎

HW10: Analyzing text data

Overview

Accessing the `hw10` repository

Your mission

Okay, I need more information

Okay, I need some data sources

How much do I really need to do?

Submit the assignment

Rubric

Benjamin Soltoff

Lecturer in Information Science

HW10: Analyzing text data

Overview

Accessing the hw10 repository

Your mission

Okay, I need more information

Okay, I need some data sources

How much do I really need to do?

Submit the assignment

Rubric

Benjamin Soltoff

Lecturer in Information Science

Accessing the `hw10` repository