class: center, middle, inverse, title-slide .title[ # Text analysis: fundamentals and sentiment analysis ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # Core text data workflows --- ## Basic workflow for text analysis * Obtain your text sources * Extract documents and move into a corpus * Transformation * Extract features * Perform analysis --- ## Obtain your text sources * Web sites * Twitter * Databases * PDF documents * Digital scans of printed materials --- ## Extract documents and move into a corpus * Text corpus * Typically stores the text as a raw character string with metadata and details stored with the text --- ## Transformation * Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.) * Standard text processing * Convert to lower case * Remove punctuation * Remove numbers * Remove stopwords * Remove domain-specific stopwords * Stemming --- ## Extract features * Convert the text string into some sort of quantifiable measures * Bag-of-words model * Term frequency vector * Term-document matrix * Ignores context * Word embeddings --- ## Word embeddings <img src="https://blogs.mathworks.com/images/loren/2017/vecs.png" width="80%" style="display: block; margin: auto;" /> --- ## Perform analysis * Basic * Word frequency * Collocation * Dictionary tagging * Advanced * Document classification * Corpora comparison * Topic modeling --- class: inverse, middle # Wrangling text data with `tidytext` --- ## [`tidytext`](https://github.com/juliasilge/tidytext) * Tidy text format * Defined as one-term-per-row * Differs from the term-document matrix * One-document-per-row and one-term-per-column --- ## Get text corpa ``` ## # A tibble: 73,422 × 4 ## text book linenumber chapter ## <chr> <fct> <int> <int> ## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0 ## 2 "" Sense & Sensibility 2 0 ## 3 "by Jane Austen" Sense & Sensibility 3 0 ## 4 "" Sense & Sensibility 4 0 ## 5 "(1811)" Sense & Sensibility 5 0 ## 6 "" Sense & Sensibility 6 0 ## 7 "" Sense & Sensibility 7 0 ## 8 "" Sense & Sensibility 8 0 ## 9 "" Sense & Sensibility 9 0 ## 10 "CHAPTER 1" Sense & Sensibility 10 1 ## # … with 73,412 more rows ``` --- ## Tokenize text ```r (tidy_books <- books %>% unnest_tokens(output = word, input = text)) ``` ``` ## # A tibble: 725,055 × 4 ## book linenumber chapter word ## <fct> <int> <int> <chr> ## 1 Sense & Sensibility 1 0 sense ## 2 Sense & Sensibility 1 0 and ## 3 Sense & Sensibility 1 0 sensibility ## 4 Sense & Sensibility 3 0 by ## 5 Sense & Sensibility 3 0 jane ## 6 Sense & Sensibility 3 0 austen ## 7 Sense & Sensibility 5 0 1811 ## 8 Sense & Sensibility 10 1 chapter ## 9 Sense & Sensibility 10 1 1 ## 10 Sense & Sensibility 13 1 the ## # … with 725,045 more rows ``` --- ## Practice using `tidytext` .task[How often is each U.S. state mentioned in a popular song?] * Billboard Year-End Hot 100 (1958-present) * Census Bureau ACS --- ## Song lyrics * [Reference](https://youtu.be/OPf0YbXqDm0?t=91) ``` ## this hit that ice cold michelle pfeiffer that white gold this one for them hood ## girls them good girls straight masterpieces stylin whilen livin it up in the ## city got chucks on with saint laurent got kiss myself im so prettyim too hot ## hot damn called a police and a fireman im too hot hot damn make a dragon wanna ## retire man im too hot hot damn say my name you know who i am im too hot hot damn ## am i bad bout that money break it downgirls hit your hallelujah whoo girls hit ## your hallelujah whoo girls hit your hallelujah whoo cause uptown funk gon give ## it to you cause uptown funk gon give it to you cause uptown funk gon give it ## to you saturday night and we in the spot dont believe me just watch come ondont ## believe me just watch uhdont believe me just watch dont believe me just watch ## dont believe me just watch dont believe me just watch hey hey hey oh meaning ## byamandah editor 70s girl group the sequence accused bruno mars and producer ## mark ronson of ripping their sound off in uptown funk their song in question is ## funk you see all stop wait a minute fill my cup put some liquor in it take a sip ## sign a check julio get the stretch ride to harlem hollywood jackson mississippi ## if we show up we gon show out smoother than a fresh jar of skippyim too hot ## hot damn called a police and a fireman im too hot hot damn make a dragon wanna ## retire man im too hot hot damn bitch say my name you know who i am im too hot ## hot damn am i bad bout that money break it downgirls hit your hallelujah whoo ## girls hit your hallelujah whoo girls hit your hallelujah whoo cause uptown funk ## gon give it to you cause uptown funk gon give it to you cause uptown funk gon ## give it to you saturday night and we in the spot dont believe me just watch ## come ondont believe me just watch uhdont believe me just watch uh dont believe ## me just watch uh dont believe me just watch dont believe me just watch hey hey ## hey ohbefore we leave lemmi tell yall a lil something uptown funk you up uptown ## funk you up uptown funk you up uptown funk you up uh i said uptown funk you up ## uptown funk you up uptown funk you up uptown funk you upcome on dance jump on ## it if you sexy then flaunt it if you freaky then own it dont brag about it come ## show mecome on dance jump on it if you sexy then flaunt it well its saturday ## night and we in the spot dont believe me just watch come ondont believe me just ## watch uhdont believe me just watch uh dont believe me just watch uh dont believe ## me just watch dont believe me just watch hey hey hey ohuptown funk you up uptown ## funk you up say what uptown funk you up uptown funk you up uptown funk you up ## uptown funk you up say what uptown funk you up uptown funk you up uptown funk ## you up uptown funk you up say what uptown funk you up uptown funk you up uptown ## funk you up uptown funk you up say what uptown funk you up ``` --- <img src="https://media.giphy.com/media/2aJM3TUEaY9Yz166e8/giphy.gif" width="80%" style="display: block; margin: auto;" />
10
:
00
--- class: inverse, middle # Sentiment analysis --- ## Sentiment analysis > I am happy --- ## Dictionaries ```r get_sentiments("bing") ``` ``` ## # A tibble: 6,786 × 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## 8 abort negative ## 9 aborted negative ## 10 aborts negative ## # … with 6,776 more rows ``` --- ## Dictionaries ```r get_sentiments("afinn") ``` ``` ## # A tibble: 2,477 × 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # … with 2,467 more rows ``` --- ## `janeaustenr` .pull-left[ ##### Sense and Sensibility <img src="https://www.quotemaster.org/images/ff/ff37422dc1a2c38dfa293ed3a0d65aa7.gif" width="80%" style="display: block; margin: auto;" /><img src="https://smartbitchestrashybooks.com/WP/wp-content/uploads/2016/07/Hugh-Grant-waving-to-Emma-Thompson.gif" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ##### Pride and Prejudice <img src="https://media.giphy.com/media/26FL4zFEQlJ2ffxXW/giphy.gif" width="80%" style="display: block; margin: auto;" /><img src="https://media.giphy.com/media/l4JyVmADBclbnDieY/giphy.gif" width="80%" style="display: block; margin: auto;" /> ] --- <img src="https://media.giphy.com/media/2wXrSikk2c8llaunlr/giphy.gif" width="80%" style="display: block; margin: auto;" /> --- ## Calculate sentiment <img src="index_files/figure-html/janeausten-sentiment-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Load Harry Potter text ``` ## # A tibble: 1,089,386 × 2 ## # Groups: book [7] ## book word ## <fct> <chr> ## 1 philosophers_stone the ## 2 philosophers_stone boy ## 3 philosophers_stone who ## 4 philosophers_stone lived ## 5 philosophers_stone mr ## 6 philosophers_stone and ## 7 philosophers_stone mrs ## 8 philosophers_stone dursley ## 9 philosophers_stone of ## 10 philosophers_stone number ## # … with 1,089,376 more rows ``` --- ## Most frequent words, by book <img src="index_files/figure-html/word-freq-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Exercises <img src="https://media.giphy.com/media/pI2paNxecnUNW/giphy.gif" width="80%" style="display: block; margin: auto;" />
10
:
00