Text analysis: classification and topic modeling

.title[
# Text analysis: classification and topic modeling
]
.author[
### INFO 5940 <br /> Cornell University
]

---

# Supervised text classification

---

## Supervised learning

1. Hand-code a small set of documents `$N = 1,000$`
1. Train a statistical learning model on the hand-coded data
1. Evaluate the effectiveness of the statistical learning model
1. Apply the final model to the remaining set of documents `$N = 1,000,000$`

---

## `USCongress`

```
## Rows: 4,449
## Columns: 7
## $ ID       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ cong     <dbl> 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 1…
## $ billnum  <dbl> 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507, 4508, 4…
## $ h_or_sen <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "…
## $ major    <dbl> 18, 18, 18, 18, 5, 21, 15, 18, 18, 18, 18, 16, 18, 12, 2, 3, …
## $ text     <chr> "To suspend temporarily the duty on Fast Magenta 2 Stage.", "…
## $ label    <fct> "Foreign trade", "Foreign trade", "Foreign trade", "Foreign t…
```

```
## [1] "To suspend temporarily the duty on Fast Magenta 2 Stage."                                                                                                                                                                                
## [2] "To suspend temporarily the duty on Fast Black 286 Stage."                                                                                                                                                                                
## [3] "To suspend temporarily the duty on mixtures of Fluazinam."                                                                                                                                                                               
## [4] "To reduce temporarily the duty on Prodiamine Technical."                                                                                                                                                                                 
## [5] "To amend the Immigration and Nationality Act in regard to Caribbean-born immigrants."                                                                                                                                                    
## [6] "To amend title 38, United States Code, to extend the eligibility for housing loans guaranteed by the Secretary of Veterans Affairs under the Native American Housing Loan Pilot Program to veterans who are married to Native Americans."
```

---

## Split the data set

```r
set.seed(123)

# convert response variable to factor
congress <- congress %>%
  mutate(major = factor(x = major, levels = major, labels = label))

# split into training and testing sets
congress_split <- initial_split(data = congress, strata = major, prop = .8)
congress_split
## <Training/Testing/Total>
## <3558/891/4449>

congress_train <- training(congress_split)
congress_test <- testing(congress_split)

# generate cross-validation folds
congress_folds <- vfold_cv(data = congress_train, strata = major)
```

---

## Class imbalance

---

## Preprocessing the data frame

```r
congress_rec <- recipe(major ~ text, data = congress_train)
```

```r
library(textrecipes)

congress_rec <- congress_rec %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = 500) %>%
  step_tfidf(text) %>%
  step_downsample(major)
```

---

## Define the model

```r
tree_spec <- decision_tree() %>%
  set_mode("classification") %>%
  set_engine("C5.0")

tree_spec
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---

## Train the model

```r
tree_wf <- workflow() %>%
  add_recipe(congress_rec) %>%
  add_model(tree_spec)
```

```r
set.seed(123)

tree_cv <- fit_resamples(
  tree_wf,
  congress_folds,
  control = control_resamples(save_pred = TRUE)
)
```

```r
tree_cv_metrics <- collect_metrics(tree_cv)
tree_cv_predictions <- collect_predictions(tree_cv)
tree_cv_metrics
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.432    10 0.00689 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.766    10 0.00706 Preprocessor1_Model1
```

---

## Confusion matrix

---

# Name That Tune!

]

]

---

# Topic modeling

---

## Topic modeling

* Themes
* Probabilistic topic models
* Latent Dirichlet allocation

---

## Topic and topic

1. I ate a banana and spinach smoothie for breakfast.
1. I like to eat broccoli and bananas.
1. Chinchillas and kittens are cute.
1. My sister adopted a kitten yesterday.
1. Look at this cute hamster munching on a piece of broccoli.

---

## LDA document structure

* Decide on the number of words N the document will have
    * [Dirichlet probability distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution)
    * Fixed set of `$k$` topics
* Generate each word in the document:
    * Pick a topic
    * Generate the word
* LDA backtracks from this assumption

---

## `appa`

---

## `appa`

```r
remotes::install_github("averyrobbins1/appa")
```

```r
library(appa)
data("appa")

glimpse(appa)
## Rows: 13,385
## Columns: 12
## $ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ book              <fct> Water, Water, Water, Water, Water, Water, Water, Wat…
## $ book_num          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ chapter           <fct> "The Boy in the Iceberg", "The Boy in the Iceberg", …
## $ chapter_num       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ character         <chr> "Katara", "Scene Description", "Sokka", "Scene Descr…
## $ full_text         <chr> "Water. Earth. Fire. Air. My grandmother used to tel…
## $ character_words   <chr> "Water. Earth. Fire. Air. My grandmother used to tel…
## $ scene_description <list> <>, <>, "[Close-up of the boy as he grins confident…
## $ writer            <chr> "‎Michael Dante DiMartino, Bryan Konietzko, Aaron Eha…
## $ director          <chr> "Dave Filoni", "Dave Filoni", "Dave Filoni", "Dave F…
## $ imdb_rating       <dbl> 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.…
```

---

## Create the recipe

```r
appa_rec <- recipe(~ id + character_words, data = appa) %>%
  step_tokenize(character_words) %>%
  step_stopwords(character_words, stopword_source = "smart") %>%
  step_ngram(character_words, num_tokens = 5, min_num_tokens = 1) %>%
  step_tokenfilter(character_words, max_tokens = 5000) %>%
  step_tf(character_words)
```

---

## Bake the recipe

```r
appa_prep <- prep(appa_rec)

appa_df <- bake(appa_prep, new_data = NULL)
appa_df %>%
  slice(1:5)
## # A tibble: 5 × 5,000
##      id tf_cha…¹ tf_ch…² tf_ch…³ tf_ch…⁴ tf_ch…⁵ tf_ch…⁶ tf_ch…⁷ tf_ch…⁸ tf_ch…⁹
##   <int>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1        0       0       0       0       0       0       0       0       0
## 2     2        0       0       0       0       0       0       0       0       0
## 3     3        0       0       0       0       0       0       0       0       0
## 4     4        0       0       0       0       0       0       0       0       0
## 5     5        0       0       0       0       0       0       0       0       0
## # … with 4,990 more variables: tf_character_words_aang_aang_aang <dbl>,
## #   tf_character_words_aang_aang_aang_aang <dbl>,
## #   tf_character_words_aang_aang_aang_aang_aang <dbl>,
## #   tf_character_words_aang_airbending <dbl>,
## #   tf_character_words_aang_avatar <dbl>, tf_character_words_aang_back <dbl>,
## #   tf_character_words_aang_big <dbl>, tf_character_words_aang_coming <dbl>,
## #   tf_character_words_aang_dad <dbl>, …
```

---

## Convert to document-term matrix

```r
appa_dtm_prep <- appa_df %>%
  # convert to long format
  pivot_longer(
    cols = -id,
    names_to = "token",
    values_to = "n"
  ) %>%
  # remove tokens with 0
  filter(n != 0) %>%
  # clean the token column so it just includes the token
  mutate(
    token = str_remove(string = token, pattern = "tf_character_words_")
  )

# id must be consecutive with no gaps
appa_new_id <- appa_dtm_prep %>%
  distinct(id) %>%
  mutate(new_id = row_number())

# create document-term matrix
appa_dtm <- left_join(x = appa_dtm_prep, y = appa_new_id) %>%
  cast_dtm(document = new_id, term = token, value = n)
appa_dtm
## <<DocumentTermMatrix (documents: 8822, terms: 4999)>>
## Non-/sparse entries: 40408/44060770
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
```
]

---

## `$k=4$`

```r
appa_lda4 <- LDA(appa_dtm, k = 4, control = list(seed = 123))
```

<img src="index_files/figure-html/appa-4-topn-1.png" width="70%" style="display: block; margin: auto;" />
---

## `$k=12$`

---

## Perplexity

* A statistical measure of how well a probability model predicts a sample
* Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
* Perplexity for LDA model with 12 topics
    * 1373.9451929

---

## Perplexity

---

## `$k=100$`

---

## LDAvis

* Interactive visualization of LDA model results
1. What is the meaning of each topic?
1. How prevalent is each topic?
1. How do the topics relate to each other?