class: center, middle, inverse, title-slide .title[ # Data wrangling: relational data and factors ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # Relational data structures --- ## Introduction to relational data * Multiple tables of data that when combined together answer research questions * Relations define the important element, not just the individual tables * Relations are defined between a pair of tables * Relational verbs * Mutating joins * Filtering joins --- class: middle <img src="https://www.hindustantimes.com/rf/image_size_960x540/HT/p2/2018/02/08/Pictures/_33b2ca74-0cc1-11e8-ba67-a8387f729390.jpeg" width="80%" style="display: block; margin: auto;" /> --- class: middle .pull-left[ <table> <caption>Superheroes</caption> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> alignment </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Batman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Sabrina </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> Archie Comics </td> </tr> </tbody> </table> ] .pull-right[ <table> <caption>Publishers</caption> <thead> <tr> <th style="text-align:left;"> publisher </th> <th style="text-align:right;"> yr_founded </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> </tr> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> </tr> </tbody> </table> ] --- class: inverse, middle # Mutating joins --- ## `inner_join()` <img src="index_files/figure-html/ijsp-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `inner_join()` ```r inner_join(x = superheroes, y = publishers, by = "publisher") ``` ``` ## # A tibble: 2 × 5 ## name alignment gender publisher yr_founded ## <chr> <chr> <chr> <chr> <dbl> ## 1 Magneto bad male Marvel 1939 ## 2 Batman good male DC 1934 ``` --- ## `left_join()` <img src="index_files/figure-html/ljsp-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `left_join()` ```r left_join(x = superheroes, y = publishers, by = "publisher") ``` ``` ## # A tibble: 3 × 5 ## name alignment gender publisher yr_founded ## <chr> <chr> <chr> <chr> <dbl> ## 1 Magneto bad male Marvel 1939 ## 2 Batman good male DC 1934 ## 3 Sabrina good female Archie Comics NA ``` --- ## `right_join()` <img src="index_files/figure-html/rjsp-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `right_join()` ```r right_join(x = superheroes, y = publishers, by = "publisher") ``` ``` ## # A tibble: 3 × 5 ## name alignment gender publisher yr_founded ## <chr> <chr> <chr> <chr> <dbl> ## 1 Magneto bad male Marvel 1939 ## 2 Batman good male DC 1934 ## 3 <NA> <NA> <NA> Image 1992 ``` --- ## `right_join()` reversed <img src="index_files/figure-html/rjsp-alt-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `full_join()` <img src="index_files/figure-html/fjsp-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `full_join()` ```r full_join(x = superheroes, y = publishers, by = "publisher") ``` ``` ## # A tibble: 4 × 5 ## name alignment gender publisher yr_founded ## <chr> <chr> <chr> <chr> <dbl> ## 1 Magneto bad male Marvel 1939 ## 2 Batman good male DC 1934 ## 3 Sabrina good female Archie Comics NA ## 4 <NA> <NA> <NA> Image 1992 ``` --- class: inverse, middle # Filtering joins --- ## `semi_join()` <img src="index_files/figure-html/sjsp-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `semi_join()` ```r semi_join(x = superheroes, y = publishers, by = "publisher") ``` ``` ## # A tibble: 2 × 4 ## name alignment gender publisher ## <chr> <chr> <chr> <chr> ## 1 Magneto bad male Marvel ## 2 Batman good male DC ``` --- ## `anti_join()` <img src="index_files/figure-html/ajsp-anim-1.gif" width="80%" style="display: block; margin: auto;" /> --- ## `anti_join()` ```r anti_join(x = superheroes, y = publishers, by = "publisher") ``` ``` ## # A tibble: 1 × 4 ## name alignment gender publisher ## <chr> <chr> <chr> <chr> ## 1 Sabrina good female Archie Comics ``` --- ## Gonna take pollution down to zero <img src="https://media.giphy.com/media/kQYNaEa35hQ6pCYywH/giphy-downsized-large.gif" width="60%" style="display: block; margin: auto;" />
10
:
00
--- class: inverse, middle # Factors --- ## Factors * Used for categorical (discrete) variables * Historically used for purposes of efficiency * Not really necessary in modern R * Best used to sort categorical variables other than alphabetically * `forcats` --- ## Character vector ```r (x1 <- c("Dec", "Apr", "Jan", "Mar")) ``` ``` ## [1] "Dec" "Apr" "Jan" "Mar" ``` ```r sort(x1) ``` ``` ## [1] "Apr" "Dec" "Jan" "Mar" ``` --- ## Levels ```r month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" ) ``` -- ## Factor ```r (y1 <- factor(x1, levels = month_levels)) ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` ```r parse_factor(x1, levels = month_levels) ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` --- ## Different levels/labels ```r (x2 <- c(12, 4, 1, 3)) ``` ``` ## [1] 12 4 1 3 ``` ```r y2 <- factor(x2, levels = seq(from = 1, to = 12), labels = month_levels ) y2 ``` ``` ## [1] Dec Apr Jan Mar ## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` --- <img src="https://media.giphy.com/media/z19k6UnH8cXzQsrWWw/giphy.gif" width="80%" style="display: block; margin: auto;" />
10
:
00