class: center, middle, inverse, title-slide .title[ # Text analysis: classification and topic modeling ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # Supervised text classification --- ## Supervised learning 1. Hand-code a small set of documents `\(N = 1,000\)` 1. Train a statistical learning model on the hand-coded data 1. Evaluate the effectiveness of the statistical learning model 1. Apply the final model to the remaining set of documents `\(N = 1,000,000\)` --- ## `USCongress` ``` ## Rows: 4,449 ## Columns: 7 ## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… ## $ cong <dbl> 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 1… ## $ billnum <dbl> 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507, 4508, 4… ## $ h_or_sen <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "… ## $ major <dbl> 18, 18, 18, 18, 5, 21, 15, 18, 18, 18, 18, 16, 18, 12, 2, 3, … ## $ text <chr> "To suspend temporarily the duty on Fast Magenta 2 Stage.", "… ## $ label <fct> "Foreign trade", "Foreign trade", "Foreign trade", "Foreign t… ``` ``` ## [1] "To suspend temporarily the duty on Fast Magenta 2 Stage." ## [2] "To suspend temporarily the duty on Fast Black 286 Stage." ## [3] "To suspend temporarily the duty on mixtures of Fluazinam." ## [4] "To reduce temporarily the duty on Prodiamine Technical." ## [5] "To amend the Immigration and Nationality Act in regard to Caribbean-born immigrants." ## [6] "To amend title 38, United States Code, to extend the eligibility for housing loans guaranteed by the Secretary of Veterans Affairs under the Native American Housing Loan Pilot Program to veterans who are married to Native Americans." ``` --- ## Split the data set ```r set.seed(123) # convert response variable to factor congress <- congress %>% mutate(major = factor(x = major, levels = major, labels = label)) # split into training and testing sets congress_split <- initial_split(data = congress, strata = major, prop = .8) congress_split ## <Training/Testing/Total> ## <3558/891/4449> congress_train <- training(congress_split) congress_test <- testing(congress_split) # generate cross-validation folds congress_folds <- vfold_cv(data = congress_train, strata = major) ``` --- ## Class imbalance <img src="index_files/figure-html/major-topic-dist-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Preprocessing the data frame ```r congress_rec <- recipe(major ~ text, data = congress_train) ``` ```r library(textrecipes) congress_rec <- congress_rec %>% step_tokenize(text) %>% step_stopwords(text) %>% step_tokenfilter(text, max_tokens = 500) %>% step_tfidf(text) %>% step_downsample(major) ``` --- ## Define the model ```r tree_spec <- decision_tree() %>% set_mode("classification") %>% set_engine("C5.0") tree_spec ## Decision Tree Model Specification (classification) ## ## Computational engine: C5.0 ``` --- ## Train the model ```r tree_wf <- workflow() %>% add_recipe(congress_rec) %>% add_model(tree_spec) ``` ```r set.seed(123) tree_cv <- fit_resamples( tree_wf, congress_folds, control = control_resamples(save_pred = TRUE) ) ``` ```r tree_cv_metrics <- collect_metrics(tree_cv) tree_cv_predictions <- collect_predictions(tree_cv) tree_cv_metrics ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy multiclass 0.432 10 0.00689 Preprocessor1_Model1 ## 2 roc_auc hand_till 0.766 10 0.00706 Preprocessor1_Model1 ``` --- ## Confusion matrix <img src="index_files/figure-html/tree-confusion-1.png" width="80%" style="display: block; margin: auto;" /> --- # Name That Tune! .pull-left[ <img src="https://media.giphy.com/media/10JbbHzFsBpg40/giphy.gif" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="https://media.giphy.com/media/7SKWbnycqb2Pze62Zk/giphy.gif" width="80%" style="display: block; margin: auto;" /> ]
15
:
00
--- class: inverse, middle # Topic modeling --- ## Topic modeling * Themes * Probabilistic topic models * Latent Dirichlet allocation --- ## Topic and topic 1. I ate a banana and spinach smoothie for breakfast. 1. I like to eat broccoli and bananas. 1. Chinchillas and kittens are cute. 1. My sister adopted a kitten yesterday. 1. Look at this cute hamster munching on a piece of broccoli. --- ## LDA document structure * Decide on the number of words N the document will have * [Dirichlet probability distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution) * Fixed set of `\(k\)` topics * Generate each word in the document: * Pick a topic * Generate the word * LDA backtracks from this assumption --- ## `appa` <img src="../../../../../../../../img/appa-avatar.gif" width="65%" style="display: block; margin: auto;" /> --- ## `appa` ```r remotes::install_github("averyrobbins1/appa") ``` ```r library(appa) data("appa") glimpse(appa) ## Rows: 13,385 ## Columns: 12 ## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1… ## $ book <fct> Water, Water, Water, Water, Water, Water, Water, Wat… ## $ book_num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ chapter <fct> "The Boy in the Iceberg", "The Boy in the Iceberg", … ## $ chapter_num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ character <chr> "Katara", "Scene Description", "Sokka", "Scene Descr… ## $ full_text <chr> "Water. Earth. Fire. Air. My grandmother used to tel… ## $ character_words <chr> "Water. Earth. Fire. Air. My grandmother used to tel… ## $ scene_description <list> <>, <>, "[Close-up of the boy as he grins confident… ## $ writer <chr> "Michael Dante DiMartino, Bryan Konietzko, Aaron Eha… ## $ director <chr> "Dave Filoni", "Dave Filoni", "Dave Filoni", "Dave F… ## $ imdb_rating <dbl> 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.… ``` --- ## Create the recipe ```r appa_rec <- recipe(~ id + character_words, data = appa) %>% step_tokenize(character_words) %>% step_stopwords(character_words, stopword_source = "smart") %>% step_ngram(character_words, num_tokens = 5, min_num_tokens = 1) %>% step_tokenfilter(character_words, max_tokens = 5000) %>% step_tf(character_words) ``` --- ## Bake the recipe ```r appa_prep <- prep(appa_rec) appa_df <- bake(appa_prep, new_data = NULL) appa_df %>% slice(1:5) ## # A tibble: 5 × 5,000 ## id tf_cha…¹ tf_ch…² tf_ch…³ tf_ch…⁴ tf_ch…⁵ tf_ch…⁶ tf_ch…⁷ tf_ch…⁸ tf_ch…⁹ ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 0 0 0 0 0 0 0 0 0 ## 2 2 0 0 0 0 0 0 0 0 0 ## 3 3 0 0 0 0 0 0 0 0 0 ## 4 4 0 0 0 0 0 0 0 0 0 ## 5 5 0 0 0 0 0 0 0 0 0 ## # … with 4,990 more variables: tf_character_words_aang_aang_aang <dbl>, ## # tf_character_words_aang_aang_aang_aang <dbl>, ## # tf_character_words_aang_aang_aang_aang_aang <dbl>, ## # tf_character_words_aang_airbending <dbl>, ## # tf_character_words_aang_avatar <dbl>, tf_character_words_aang_back <dbl>, ## # tf_character_words_aang_big <dbl>, tf_character_words_aang_coming <dbl>, ## # tf_character_words_aang_dad <dbl>, … ``` --- ## Convert to document-term matrix .tiny[ ```r appa_dtm_prep <- appa_df %>% # convert to long format pivot_longer( cols = -id, names_to = "token", values_to = "n" ) %>% # remove tokens with 0 filter(n != 0) %>% # clean the token column so it just includes the token mutate( token = str_remove(string = token, pattern = "tf_character_words_") ) # id must be consecutive with no gaps appa_new_id <- appa_dtm_prep %>% distinct(id) %>% mutate(new_id = row_number()) # create document-term matrix appa_dtm <- left_join(x = appa_dtm_prep, y = appa_new_id) %>% cast_dtm(document = new_id, term = token, value = n) appa_dtm ## <<DocumentTermMatrix (documents: 8822, terms: 4999)>> ## Non-/sparse entries: 40408/44060770 ## Sparsity : 100% ## Maximal term length: 40 ## Weighting : term frequency (tf) ``` ] --- ## `\(k=4\)` ```r appa_lda4 <- LDA(appa_dtm, k = 4, control = list(seed = 123)) ``` <img src="index_files/figure-html/appa-4-topn-1.png" width="70%" style="display: block; margin: auto;" /> --- ## `\(k=12\)` <img src="index_files/figure-html/appa-12-topn-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Perplexity * A statistical measure of how well a probability model predicts a sample * Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents * Perplexity for LDA model with 12 topics * 1373.9451929 --- ## Perplexity <img src="index_files/figure-html/appa_lda_compare_viz-1.png" width="80%" style="display: block; margin: auto;" /> --- ## `\(k=100\)` <img src="index_files/figure-html/appa-100-topn-1.png" width="80%" style="display: block; margin: auto;" /> --- ## LDAvis * Interactive visualization of LDA model results 1. What is the meaning of each topic? 1. How prevalent is each topic? 1. How do the topics relate to each other?