Practicing tidytext with song titles

library(tidyverse)
library(acs)
library(tidytext)
library(here)

set.seed(1234)
theme_set(theme_minimal())

Run the code below in your console to download this exercise as a set of R scripts.

usethis::use_course("cis-ds/text-analysis-fundamentals-and-sentiment-analysis")

Today let’s practice our tidytext skills with a basic analysis of song titles. That is, how often is each U.S. state mentioned in a popular song? We’ll define popular songs as those in Billboard’s Year-End Hot 100 from 1958 to the present.

Download population data for U.S. states

First let’s use the tidycensus package to access the U.S. Census Bureau API and obtain population numbers for each state in 2016. This will help us later to normalize state mentions based on relative population size.¹

To import the data in-class, run:

pop_df <- read_csv("http://info5940.infosci.cornell.edu/data/pop2016.csv")

The code below shows how the file was originally constructed.

# retrieve state populations in 2016 from Census Bureau ACS
library(tidycensus)
pop_df <- get_acs(
  geography = "state", year = 2016,
  variables = c(population = "B01003_001")
) %>%
  # remove moe and tidy the data frame
  select(-moe) %>%
  spread(variable, estimate) %>%
  # clean the data to match with the structure of the lyrics data
  rename(state_name = NAME) %>%
  mutate(state_name = str_to_lower(state_name)) %>%
  filter(state_name != "Puerto Rico") %>%
  write_csv(here("static", "data", "pop2016.csv"))

## Getting data from the 2012-2016 5-year ACS

# do these results make sense?
pop_df %>%
  arrange(desc(population)) %>%
  top_n(10)

## Selecting by population

## # A tibble: 10 × 3
##    GEOID state_name     population
##    <chr> <chr>               <dbl>
##  1 06    california       38654206
##  2 48    texas            26956435
##  3 12    florida          19934451
##  4 36    new york         19697457
##  5 17    illinois         12851684
##  6 42    pennsylvania     12783977
##  7 39    ohio             11586941
##  8 13    georgia          10099320
##  9 37    north carolina    9940828
## 10 26    michigan          9909600

Retrieve song lyrics

Next we need to retrieve the song lyrics for all our songs. Kaylin Walker provides a GitHub repo with the necessary files.

To import the data in-class, use

song_lyrics <- read_csv("http://info5940.infosci.cornell.edu/data/billboard_lyrics_1964-2015.csv")

## Rows: 5100 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Song, Artist, Lyrics
## dbl (3): Rank, Year, Source
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Rows: 5,100
## Columns: 6
## $ Rank   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ Song   <chr> "wooly bully", "i cant help myself sugar pie honey bunch", "i c…
## $ Artist <chr> "sam the sham and the pharaohs", "four tops", "the rolling ston…
## $ Year   <dbl> 1965, 1965, 1965, 1965, 1965, 1965, 1965, 1965, 1965, 1965, 196…
## $ Lyrics <chr> "sam the sham miscellaneous wooly bully wooly bully sam the sha…
## $ Source <dbl> 3, 1, 1, 1, 1, 1, 3, 5, 1, 3, 3, 1, 3, 1, 3, 3, 3, 3, 1, 1, 1, …

The lyrics are stored as character vectors, one string for each song. Consider the song Uptown Funk:

## this hit that ice cold michelle pfeiffer that white gold this one for them hood
## girls them good girls straight masterpieces stylin whilen livin it up in the
## city got chucks on with saint laurent got kiss myself im so prettyim too hot
## hot damn called a police and a fireman im too hot hot damn make a dragon wanna
## retire man im too hot hot damn say my name you know who i am im too hot hot damn
## am i bad bout that money break it downgirls hit your hallelujah whoo girls hit
## your hallelujah whoo girls hit your hallelujah whoo cause uptown funk gon give
## it to you cause uptown funk gon give it to you cause uptown funk gon give it
## to you saturday night and we in the spot dont believe me just watch come ondont
## believe me just watch uhdont believe me just watch dont believe me just watch
## dont believe me just watch dont believe me just watch hey hey hey oh meaning
## byamandah editor 70s girl group the sequence accused bruno mars and producer
## mark ronson of ripping their sound off in uptown funk their song in question is
## funk you see all stop wait a minute fill my cup put some liquor in it take a sip
## sign a check julio get the stretch ride to harlem hollywood jackson mississippi
## if we show up we gon show out smoother than a fresh jar of skippyim too hot
## hot damn called a police and a fireman im too hot hot damn make a dragon wanna
## retire man im too hot hot damn bitch say my name you know who i am im too hot
## hot damn am i bad bout that money break it downgirls hit your hallelujah whoo
## girls hit your hallelujah whoo girls hit your hallelujah whoo cause uptown funk
## gon give it to you cause uptown funk gon give it to you cause uptown funk gon
## give it to you saturday night and we in the spot dont believe me just watch
## come ondont believe me just watch uhdont believe me just watch uh dont believe
## me just watch uh dont believe me just watch dont believe me just watch hey hey
## hey ohbefore we leave lemmi tell yall a lil something uptown funk you up uptown
## funk you up uptown funk you up uptown funk you up uh i said uptown funk you up
## uptown funk you up uptown funk you up uptown funk you upcome on dance jump on
## it if you sexy then flaunt it if you freaky then own it dont brag about it come
## show mecome on dance jump on it if you sexy then flaunt it well its saturday
## night and we in the spot dont believe me just watch come ondont believe me just
## watch uhdont believe me just watch uh dont believe me just watch uh dont believe
## me just watch dont believe me just watch hey hey hey ohuptown funk you up uptown
## funk you up say what uptown funk you up uptown funk you up uptown funk you up
## uptown funk you up say what uptown funk you up uptown funk you up uptown funk
## you up uptown funk you up say what uptown funk you up uptown funk you up uptown
## funk you up uptown funk you up say what uptown funk you up

Find and visualize the state names in the song lyrics

Now your work begins!

Use `tidytext` to create a data frame with one row for each token in each song

Hint: To search for matching state names, this data frame should include both unigrams and bi-grams.

Click for the solution

# tokenize
lyrics_unigrams <- unnest_tokens(
  tbl = song_lyrics,
  output = word,
  input = Lyrics
)
lyrics_bigrams <- unnest_tokens(
  tbl = song_lyrics,
  output = word,
  input = Lyrics,
  token = "ngrams", n = 2
)

# combine together
tidy_lyrics <- bind_rows(lyrics_unigrams, lyrics_bigrams)
tidy_lyrics

## # A tibble: 3,201,465 × 6
##     Rank Song        Artist                         Year Source word         
##    <dbl> <chr>       <chr>                         <dbl>  <dbl> <chr>        
##  1     1 wooly bully sam the sham and the pharaohs  1965      3 sam          
##  2     1 wooly bully sam the sham and the pharaohs  1965      3 the          
##  3     1 wooly bully sam the sham and the pharaohs  1965      3 sham         
##  4     1 wooly bully sam the sham and the pharaohs  1965      3 miscellaneous
##  5     1 wooly bully sam the sham and the pharaohs  1965      3 wooly        
##  6     1 wooly bully sam the sham and the pharaohs  1965      3 bully        
##  7     1 wooly bully sam the sham and the pharaohs  1965      3 wooly        
##  8     1 wooly bully sam the sham and the pharaohs  1965      3 bully        
##  9     1 wooly bully sam the sham and the pharaohs  1965      3 sam          
## 10     1 wooly bully sam the sham and the pharaohs  1965      3 the          
## # … with 3,201,455 more rows
## # ℹ Use `print(n = ...)` to see more rows

The variable word in this data frame contains all the possible words and bigrams that might be state names in all the lyrics.

Find all the state names occurring in the song lyrics

First create a data frame that meets this criteria, then save a new data frame that only includes one observation for each matching song. That is, if the song is “New York, New York”, there should only be one row in the resulting table for that song.

Click for the solution

inner_join(tidy_lyrics, pop_df, by = c("word" = "state_name"))

## # A tibble: 526 × 8
##     Rank Song               Artist          Year Source word       GEOID popul…¹
##    <dbl> <chr>              <chr>          <dbl>  <dbl> <chr>      <chr>   <dbl>
##  1    12 king of the road   roger miller    1965      1 maine      23     1.33e6
##  2    29 eve of destruction barry mcguire   1965      1 alabama    01     4.84e6
##  3    49 california girls   the beach boys  1965      3 california 06     3.87e7
##  4    49 california girls   the beach boys  1965      3 california 06     3.87e7
##  5    49 california girls   the beach boys  1965      3 california 06     3.87e7
##  6    49 california girls   the beach boys  1965      3 california 06     3.87e7
##  7    49 california girls   the beach boys  1965      3 california 06     3.87e7
##  8    49 california girls   the beach boys  1965      3 california 06     3.87e7
##  9    49 california girls   the beach boys  1965      3 california 06     3.87e7
## 10    49 california girls   the beach boys  1965      3 california 06     3.87e7
## # … with 516 more rows, and abbreviated variable name ¹population
## # ℹ Use `print(n = ...)` to see more rows

Let’s only count each state once per song that it is mentioned in.

tidy_lyrics <- inner_join(tidy_lyrics, pop_df, by = c("word" = "state_name")) %>%
  distinct(Rank, Song, Artist, Year, word, .keep_all = TRUE)
tidy_lyrics

## # A tibble: 253 × 8
##     Rank Song                          Artist    Year Source word  GEOID popul…¹
##    <dbl> <chr>                         <chr>    <dbl>  <dbl> <chr> <chr>   <dbl>
##  1    12 king of the road              roger m…  1965      1 maine 23     1.33e6
##  2    29 eve of destruction            barry m…  1965      1 alab… 01     4.84e6
##  3    49 california girls              the bea…  1965      3 cali… 06     3.87e7
##  4    10 california dreamin            the mam…  1966      3 cali… 06     3.87e7
##  5    77 message to michael            dionne …  1966      1 kent… 21     4.41e6
##  6    61 california nights             lesley …  1967      1 cali… 06     3.87e7
##  7     4 sittin on the dock of the bay otis re…  1968      1 geor… 13     1.01e7
##  8    10 tighten up                    archie …  1968      3 texas 48     2.70e7
##  9    25 get back                      the bea…  1969      3 ariz… 04     6.73e6
## 10    25 get back                      the bea…  1969      3 cali… 06     3.87e7
## # … with 243 more rows, and abbreviated variable name ¹population
## # ℹ Use `print(n = ...)` to see more rows

Calculate the frequency for each state’s mention in a song and create a new column for the frequency adjusted by the state’s population

Click for the solution

(state_counts <- tidy_lyrics %>%
  count(word) %>%
  arrange(desc(n)))

## # A tibble: 33 × 2
##    word            n
##    <chr>       <int>
##  1 new york       64
##  2 california     34
##  3 georgia        22
##  4 tennessee      14
##  5 texas          14
##  6 alabama        12
##  7 mississippi    10
##  8 kentucky        7
##  9 hawaii          6
## 10 illinois        6
## # … with 23 more rows
## # ℹ Use `print(n = ...)` to see more rows

pop_df <- pop_df %>%
  left_join(state_counts, by = c("state_name" = "word")) %>%
  mutate(rate = n / population * 1e6)

## which are the top ten states by rate?
pop_df %>%
  arrange(desc(rate)) %>%
  top_n(10)

## Selecting by rate

## # A tibble: 10 × 5
##    GEOID state_name  population     n  rate
##    <chr> <chr>            <dbl> <int> <dbl>
##  1 15    hawaii         1413673     6  4.24
##  2 28    mississippi    2989192    10  3.35
##  3 36    new york      19697457    64  3.25
##  4 01    alabama        4841164    12  2.48
##  5 23    maine          1329923     3  2.26
##  6 13    georgia       10099320    22  2.18
##  7 47    tennessee      6548009    14  2.14
##  8 30    montana        1023391     2  1.95
##  9 31    nebraska       1881259     3  1.59
## 10 21    kentucky       4411989     7  1.59

Make a choropleth map for both the raw frequency counts and relative frequency counts

The statebins package is a nifty shortcut for making basic U.S. cartogram maps.

library(statebins)

pop_df %>%
  mutate(
    state_name = stringr::str_to_title(state_name),
    state_name = if_else(state_name == "District Of Columbia",
      "District of Columbia", state_name
    )
  ) %>%
  statebins(
    state_col = "state_name", value_col = "n",
    name = "Number of mentions"
  ) +
  labs(title = "Frequency of states mentioned in song lyrics") +
  theme_statebins()

pop_df %>%
  mutate(
    state_name = stringr::str_to_title(state_name),
    state_name = if_else(state_name == "District Of Columbia",
      "District of Columbia", state_name
    )
  ) %>%
  statebins(
    state_col = "state_name", value_col = "rate",
    name = "Number of mentions per capita"
  ) +
  labs(title = "Frequency of states mentioned in song lyrics") +
  theme_statebins()

Acknowledgments

This page is derived in part from SONG LYRICS ACROSS THE UNITED STATES and licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Session Info

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.1 (2022-06-23)
##  os       macOS Monterey 12.3
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/New_York
##  date     2022-08-22
##  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package       * version    date (UTC) lib source
##  acs           * 2.1.4      2019-02-19 [2] CRAN (R 4.2.0)
##  assertthat      0.2.1      2019-03-21 [2] CRAN (R 4.2.0)
##  backports       1.4.1      2021-12-13 [2] CRAN (R 4.2.0)
##  blogdown        1.10       2022-05-10 [2] CRAN (R 4.2.0)
##  bookdown        0.27       2022-06-14 [2] CRAN (R 4.2.0)
##  broom           1.0.0      2022-07-01 [2] CRAN (R 4.2.0)
##  bslib           0.4.0      2022-07-16 [2] CRAN (R 4.2.0)
##  cachem          1.0.6      2021-08-19 [2] CRAN (R 4.2.0)
##  cellranger      1.1.0      2016-07-27 [2] CRAN (R 4.2.0)
##  cli             3.3.0      2022-04-25 [2] CRAN (R 4.2.0)
##  colorspace      2.0-3      2022-02-21 [2] CRAN (R 4.2.0)
##  crayon          1.5.1      2022-03-26 [2] CRAN (R 4.2.0)
##  DBI             1.1.3      2022-06-18 [2] CRAN (R 4.2.0)
##  dbplyr          2.2.1      2022-06-27 [2] CRAN (R 4.2.0)
##  digest          0.6.29     2021-12-01 [2] CRAN (R 4.2.0)
##  dplyr         * 1.0.9      2022-04-28 [2] CRAN (R 4.2.0)
##  ellipsis        0.3.2      2021-04-29 [2] CRAN (R 4.2.0)
##  evaluate        0.16       2022-08-09 [1] CRAN (R 4.2.1)
##  fansi           1.0.3      2022-03-24 [2] CRAN (R 4.2.0)
##  fastmap         1.1.0      2021-01-25 [2] CRAN (R 4.2.0)
##  forcats       * 0.5.1      2021-01-27 [2] CRAN (R 4.2.0)
##  fs              1.5.2      2021-12-08 [2] CRAN (R 4.2.0)
##  gargle          1.2.0      2021-07-02 [2] CRAN (R 4.2.0)
##  generics        0.1.3      2022-07-05 [2] CRAN (R 4.2.0)
##  ggplot2       * 3.3.6      2022-05-03 [2] CRAN (R 4.2.0)
##  glue            1.6.2      2022-02-24 [2] CRAN (R 4.2.0)
##  googledrive     2.0.0      2021-07-08 [2] CRAN (R 4.2.0)
##  googlesheets4   1.0.0      2021-07-21 [2] CRAN (R 4.2.0)
##  gtable          0.3.0      2019-03-25 [2] CRAN (R 4.2.0)
##  haven           2.5.0      2022-04-15 [2] CRAN (R 4.2.0)
##  here          * 1.0.1      2020-12-13 [2] CRAN (R 4.2.0)
##  hms             1.1.1      2021-09-26 [2] CRAN (R 4.2.0)
##  htmltools       0.5.3      2022-07-18 [2] CRAN (R 4.2.0)
##  httr            1.4.3      2022-05-04 [2] CRAN (R 4.2.0)
##  janeaustenr     0.1.5      2017-06-10 [2] CRAN (R 4.2.0)
##  jquerylib       0.1.4      2021-04-26 [2] CRAN (R 4.2.0)
##  jsonlite        1.8.0      2022-02-22 [2] CRAN (R 4.2.0)
##  knitr           1.39       2022-04-26 [2] CRAN (R 4.2.0)
##  lattice         0.20-45    2021-09-22 [2] CRAN (R 4.2.1)
##  lifecycle       1.0.1      2021-09-24 [2] CRAN (R 4.2.0)
##  lubridate       1.8.0      2021-10-07 [2] CRAN (R 4.2.0)
##  magrittr        2.0.3      2022-03-30 [2] CRAN (R 4.2.0)
##  Matrix          1.4-1      2022-03-23 [2] CRAN (R 4.2.1)
##  modelr          0.1.8      2020-05-19 [2] CRAN (R 4.2.0)
##  munsell         0.5.0      2018-06-12 [2] CRAN (R 4.2.0)
##  pillar          1.8.0      2022-07-18 [2] CRAN (R 4.2.0)
##  pkgconfig       2.0.3      2019-09-22 [2] CRAN (R 4.2.0)
##  plyr            1.8.7      2022-03-24 [2] CRAN (R 4.2.0)
##  purrr         * 0.3.4      2020-04-17 [2] CRAN (R 4.2.0)
##  R6              2.5.1      2021-08-19 [2] CRAN (R 4.2.0)
##  Rcpp            1.0.9      2022-07-08 [2] CRAN (R 4.2.0)
##  readr         * 2.1.2      2022-01-30 [2] CRAN (R 4.2.0)
##  readxl          1.4.0      2022-03-28 [2] CRAN (R 4.2.0)
##  reprex          2.0.1.9000 2022-08-10 [1] Github (tidyverse/reprex@6d3ad07)
##  rlang           1.0.4      2022-07-12 [2] CRAN (R 4.2.0)
##  rmarkdown       2.14       2022-04-25 [2] CRAN (R 4.2.0)
##  rprojroot       2.0.3      2022-04-02 [2] CRAN (R 4.2.0)
##  rstudioapi      0.13       2020-11-12 [2] CRAN (R 4.2.0)
##  rvest           1.0.2      2021-10-16 [2] CRAN (R 4.2.0)
##  sass            0.4.2      2022-07-16 [2] CRAN (R 4.2.0)
##  scales          1.2.0      2022-04-13 [2] CRAN (R 4.2.0)
##  sessioninfo     1.2.2      2021-12-06 [2] CRAN (R 4.2.0)
##  SnowballC       0.7.0      2020-04-01 [2] CRAN (R 4.2.0)
##  stringi         1.7.8      2022-07-11 [2] CRAN (R 4.2.0)
##  stringr       * 1.4.0      2019-02-10 [2] CRAN (R 4.2.0)
##  tibble        * 3.1.8      2022-07-22 [2] CRAN (R 4.2.0)
##  tidyr         * 1.2.0      2022-02-01 [2] CRAN (R 4.2.0)
##  tidyselect      1.1.2      2022-02-21 [2] CRAN (R 4.2.0)
##  tidytext      * 0.3.3      2022-05-09 [2] CRAN (R 4.2.0)
##  tidyverse     * 1.3.2      2022-07-18 [2] CRAN (R 4.2.0)
##  tokenizers      0.2.1      2018-03-29 [2] CRAN (R 4.2.0)
##  tzdb            0.3.0      2022-03-28 [2] CRAN (R 4.2.0)
##  utf8            1.2.2      2021-07-24 [2] CRAN (R 4.2.0)
##  vctrs           0.4.1      2022-04-13 [2] CRAN (R 4.2.0)
##  withr           2.5.0      2022-03-03 [2] CRAN (R 4.2.0)
##  xfun            0.31       2022-05-10 [1] CRAN (R 4.2.0)
##  XML           * 3.99-0.10  2022-06-09 [2] CRAN (R 4.2.0)
##  xml2            1.3.3      2021-11-30 [2] CRAN (R 4.2.0)
##  yaml            2.3.5      2022-02-21 [2] CRAN (R 4.2.0)
## 
##  [1] /Users/soltoffbc/Library/R/arm64/4.2/library
##  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────

For instance, California has a lot more people than Rhode Island so it makes sense that California would be mentioned more often in popular songs. But per capita, are these mentions different? ↩︎

Last updated on Mar 1, 2019

Practicing tidytext with song titles

Download population data for U.S. states

Retrieve song lyrics

Find and visualize the state names in the song lyrics

Use tidytext to create a data frame with one row for each token in each song

Find all the state names occurring in the song lyrics

Calculate the frequency for each state’s mention in a song and create a new column for the frequency adjusted by the state’s population

Make a choropleth map for both the raw frequency counts and relative frequency counts

Acknowledgments

Session Info

Use `tidytext` to create a data frame with one row for each token in each song