class: center, middle, inverse, title-slide .title[ # Vectors and iteration ] .author[ ### INFO 5940
Cornell University ] --- <img src="https://r4ds.had.co.nz/diagrams/data-structures-overview.png" width="60%" style="display: block; margin: auto;" /> --- class: inverse, middle # Atomic vectors --- ## Logical vectors ```r parse_logical(c("TRUE", "TRUE", "FALSE", "TRUE", "NA")) ## [1] TRUE TRUE FALSE TRUE NA ``` -- ## Numeric vectors ```r parse_integer(c("1", "5", "3", "4", "12423")) ## [1] 1 5 3 4 12423 parse_double(c("4.2", "4", "6", "53.2")) ## [1] 4.2 4.0 6.0 53.2 ``` -- ## Character vectors ```r parse_character(c("Goodnight Moon", "Runaway Bunny", "Big Red Barn")) ## [1] "Goodnight Moon" "Runaway Bunny" "Big Red Barn" ``` --- ## Scalars ```r (x <- sample(10)) ``` ``` ## [1] 10 6 5 4 1 8 2 7 9 3 ``` ```r x + c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100) ``` ``` ## [1] 110 106 105 104 101 108 102 107 109 103 ``` ```r x + 100 ``` ``` ## [1] 110 106 105 104 101 108 102 107 109 103 ``` --- ## Vector recycling ```r # create a sequence of numbers between 1 and 10 (x1 <- seq(from = 1, to = 2)) ``` ``` ## [1] 1 2 ``` ```r (x2 <- seq(from = 1, to = 10)) ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` ```r # add together two sequences of numbers x1 + x2 ``` ``` ## [1] 2 4 4 6 6 8 8 10 10 12 ``` --- ## Subsetting vectors ```r x <- c("one", "two", "three", "four", "five") ``` * With positive integers ```r x[c(3, 2, 5)] ## [1] "three" "two" "five" ``` * With negative integers ```r x[c(-1, -3, -5)] ## [1] "two" "four" ``` * Don't mix positive and negative ```r x[c(-1, 1)] ## Error in x[c(-1, 1)]: only 0's may be mixed with negative subscripts ``` --- ## Subset with a logical vector ```r (x <- c(10, 3, NA, 5, 8, 1, NA)) ``` ``` ## [1] 10 3 NA 5 8 1 NA ``` ```r # All non-missing values of x !is.na(x) ``` ``` ## [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE ``` ```r x[!is.na(x)] ``` ``` ## [1] 10 3 5 8 1 ``` ```r # All even (or missing!) values of x x[x %% 2 == 0] ``` ``` ## [1] 10 NA 8 NA ``` --- class: inverse, middle # Lists --- ## Lists ```r x <- list(1, 2, 3) x ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 2 ## ## [[3]] ## [1] 3 ``` --- ## Lists: `str()` ```r str(x) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 2 ## $ : num 3 ``` ```r x_named <- list(a = 1, b = 2, c = 3) str(x_named) ``` ``` ## List of 3 ## $ a: num 1 ## $ b: num 2 ## $ c: num 3 ``` --- ## Store a mix of objects ```r y <- list("a", 1L, 1.5, TRUE) str(y) ``` ``` ## List of 4 ## $ : chr "a" ## $ : int 1 ## $ : num 1.5 ## $ : logi TRUE ``` --- <img src="../../../../../../../../img/xzibit-lists.jpg" width="80%" style="display: block; margin: auto;" /> --- ## Nested lists ```r z <- list(list(1, 2), list(3, 4)) str(z) ``` ``` ## List of 2 ## $ :List of 2 ## ..$ : num 1 ## ..$ : num 2 ## $ :List of 2 ## ..$ : num 3 ## ..$ : num 4 ``` --- ## Secret lists ```r str(gun_deaths) ``` ``` ## spec_tbl_df [100,798 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ id : num [1:100798] 1 2 3 4 5 6 7 8 9 10 ... ## $ year : num [1:100798] 2012 2012 2012 2012 2012 ... ## $ month : chr [1:100798] "Jan" "Jan" "Jan" "Feb" ... ## $ intent : chr [1:100798] "Suicide" "Suicide" "Suicide" "Suicide" ... ## $ police : num [1:100798] 0 0 0 0 0 0 0 0 0 0 ... ## $ sex : chr [1:100798] "M" "F" "M" "M" ... ## $ age : num [1:100798] 34 21 60 64 31 17 48 41 50 NA ... ## $ race : chr [1:100798] "Asian/Pacific Islander" "White" "White" "White" ... ## $ place : chr [1:100798] "Home" "Street" "Other specified" "Home" ... ## $ education: Factor w/ 4 levels "Less than HS",..: 4 3 4 4 2 1 2 2 3 NA ... ``` --- <img src="https://r4ds.had.co.nz/diagrams/lists-subsetting.png" width="60%" style="display: block; margin: auto;" /> --- ## Exercise on subsetting vectors <img src="https://media.giphy.com/media/uLUgjrzvQPXV5sTZeY/giphy.gif" width="50%" style="display: block; margin: auto;" />
12
:
00
--- class: inverse, middle # Iteration --- ## Iteration ```r df <- tibble( a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10) ) ``` ```r median(df$a) ## [1] 0.1642894 median(df$b) ## [1] 0.01641118 median(df$c) ## [1] 0.2734794 median(df$d) ## [1] -0.639297 ``` --- ## Iteration three ways 1. `for` loops 1. `map_*()` functions 1. `across()` --- class: inverse, middle # Iteration with `for` loops --- ## Iteration with `for` loop ```r output <- vector(mode = "double", length = ncol(df)) for (i in seq_along(df)) { output[[i]] <- median(df[[i]]) } output ``` ``` ## [1] 0.16428940 0.01641118 0.27347942 -0.63929695 ``` --- ## Output ```r output <- vector(mode = "double", length = ncol(df)) ``` ```r vector(mode = "double", length = ncol(df)) ## [1] 0 0 0 0 vector(mode = "logical", length = ncol(df)) ## [1] FALSE FALSE FALSE FALSE vector(mode = "character", length = ncol(df)) ## [1] "" "" "" "" vector(mode = "list", length = ncol(df)) ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ``` --- ## Sequence ```r i in seq_along(df) ``` ```r seq_along(df) ``` ``` ## [1] 1 2 3 4 ``` --- ## Body ```r output[[i]] <- median(df[[i]]) ``` --- ## Preallocation .panelset[ .panel[.panel-name[Code] ```r # no preallocation mpg_no_preall <- tibble() for(i in 1:100){ mpg_no_preall <- bind_rows(mpg_no_preall, mpg) } # with preallocation using a list mpg_preall <- vector(mode = "list", length = 100) for(i in 1:100){ mpg_preall[[i]] <- mpg } mpg_preall <- bind_rows(mpg_preall) ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-28-1.png" width="70%" style="display: block; margin: auto;" /> ] ] --- ## Exercise on `for()` loops <img src="https://media.giphy.com/media/DC2YXS4efT0R4wwXoY/giphy.gif" width="80%" style="display: block; margin: auto;" />
08
:
00
--- class: inverse, middle # Iteration with `map_*()` functions --- ## Map functions * Why `for` loops are good * Why `map()` functions may be better * Types of `map()` functions * `map()` makes a list * `map_lgl()` makes a logical vector * `map_int()` makes an integer vector * `map_dbl()` makes a double vector * `map_chr()` makes a character vector --- ## Map functions ```r map_dbl(df, mean) ``` ``` ## a b c d ## 0.1694536 -0.1974360 0.3113976 -0.5095255 ``` ```r map_dbl(df, median) ``` ``` ## a b c d ## 0.16428940 0.01641118 0.27347942 -0.63929695 ``` ```r map_dbl(df, sd) ``` ``` ## a b c d ## 0.5311992 1.0300788 0.8834578 1.0414939 ``` --- ## Map functions ```r map_dbl(df, mean, na.rm = TRUE) ``` ``` ## a b c d ## 0.1694536 -0.1974360 0.3113976 -0.5095255 ``` -- ```r df %>% map_dbl(mean, na.rm = TRUE) ``` ``` ## a b c d ## 0.1694536 -0.1974360 0.3113976 -0.5095255 ``` --- ## Exercise on writing `map_*()` functions <img src="https://media.giphy.com/media/cjbfyJrICOaKIXBWyG/giphy.gif" width="80%" style="display: block; margin: auto;" />
08
:
00
--- class: inverse, middle # Iteration in data frames with `across()` --- # Single column ```r car_prices %>% summarize(Price = mean(Price)) ``` ``` ## # A tibble: 1 × 1 ## Price ## <dbl> ## 1 21343. ``` --- # Multiple columns ```r car_prices %>% summarize( Price = mean(Price), Mileage = mean(Mileage), Cylinder = mean(Cylinder), Doors = mean(Doors), Cruise = mean(Cruise), Sound = mean(Sound), Leather = mean(Leather), Buick = mean(Buick), Cadillac = mean(Cadillac), Chevy = mean(Chevy), Pontiac = mean(Pontiac), Saab = mean(Saab), Saturn = mean(Saturn), convertible = mean(convertible), coupe = mean(coupe), hatchback = mean(hatchback), sedan = mean(sedan), wagon = mean(wagon) ) ``` ``` ## # A tibble: 1 × 18 ## Price Mileage Cylin…¹ Doors Cruise Sound Leather Buick Cadil…² Chevy Pontiac ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21343. 19832. 5.27 3.53 0.752 0.679 0.724 0.0995 0.0995 0.398 0.187 ## # … with 7 more variables: Saab <dbl>, Saturn <dbl>, convertible <dbl>, ## # coupe <dbl>, hatchback <dbl>, sedan <dbl>, wagon <dbl>, and abbreviated ## # variable names ¹Cylinder, ²Cadillac ``` --- <img src="../../../../../../../../img/dplyr_across.png" width="80%" style="display: block; margin: auto;" /> --- ## `dplyr::across()` `across()` has two primary arguments: * `.cols`, selects the columns you want to operate on * `.fns`, is a function or list of functions to apply to each column --- ## `summarize()`, `across()`, and `everything()` .panelset[ .panel[.panel-name[Single function] ```r car_prices %>% summarize(across(.cols = everything(), .fns = mean)) ``` ``` ## # A tibble: 1 × 18 ## Price Mileage Cylin…¹ Doors Cruise Sound Leather Buick Cadil…² Chevy Pontiac ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21343. 19832. 5.27 3.53 0.752 0.679 0.724 0.0995 0.0995 0.398 0.187 ## # … with 7 more variables: Saab <dbl>, Saturn <dbl>, convertible <dbl>, ## # coupe <dbl>, hatchback <dbl>, sedan <dbl>, wagon <dbl>, and abbreviated ## # variable names ¹Cylinder, ²Cadillac ``` ] .panel[.panel-name[Multiple functions] ```r car_prices %>% summarize(across(everything(), .fns = list(min, max))) ``` ``` ## # A tibble: 1 × 36 ## Price_1 Price_2 Mileage_1 Mileage_2 Cylinder_1 Cylin…¹ Doors_1 Doors_2 Cruis…² ## <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> ## 1 8639. 70755. 266 50387 4 8 2 4 0 ## # … with 27 more variables: Cruise_2 <int>, Sound_1 <int>, Sound_2 <int>, ## # Leather_1 <int>, Leather_2 <int>, Buick_1 <int>, Buick_2 <int>, ## # Cadillac_1 <int>, Cadillac_2 <int>, Chevy_1 <int>, Chevy_2 <int>, ## # Pontiac_1 <int>, Pontiac_2 <int>, Saab_1 <int>, Saab_2 <int>, ## # Saturn_1 <int>, Saturn_2 <int>, convertible_1 <int>, convertible_2 <int>, ## # coupe_1 <int>, coupe_2 <int>, hatchback_1 <int>, hatchback_2 <int>, ## # sedan_1 <int>, sedan_2 <int>, wagon_1 <int>, wagon_2 <int>, and … ``` ] .panel[.panel-name[Multiple named functions] ```r car_prices %>% summarize(across(everything(), .fns = list(min = min, max = max))) ``` ``` ## # A tibble: 1 × 36 ## Price_min Price_max Mileage_…¹ Milea…² Cylin…³ Cylin…⁴ Doors…⁵ Doors…⁶ Cruis…⁷ ## <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> ## 1 8639. 70755. 266 50387 4 8 2 4 0 ## # … with 27 more variables: Cruise_max <int>, Sound_min <int>, Sound_max <int>, ## # Leather_min <int>, Leather_max <int>, Buick_min <int>, Buick_max <int>, ## # Cadillac_min <int>, Cadillac_max <int>, Chevy_min <int>, Chevy_max <int>, ## # Pontiac_min <int>, Pontiac_max <int>, Saab_min <int>, Saab_max <int>, ## # Saturn_min <int>, Saturn_max <int>, convertible_min <int>, ## # convertible_max <int>, coupe_min <int>, coupe_max <int>, ## # hatchback_min <int>, hatchback_max <int>, sedan_min <int>, … ``` ] .panel[.panel-name[Grouped by] ```r car_prices %>% group_by(Cylinder) %>% summarize(across(everything(), .fns = mean)) ``` ``` ## # A tibble: 3 × 18 ## Cylinder Price Mileage Doors Cruise Sound Leather Buick Cadil…¹ Chevy Pontiac ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 17863. 20108. 3.44 0.599 0.698 0.746 0 0 0.457 0.127 ## 2 6 20081. 19564. 3.74 0.868 0.706 0.606 0.258 0.0645 0.387 0.258 ## 3 8 38968. 19575. 3.2 1 0.52 1 0 0.6 0.2 0.2 ## # … with 7 more variables: Saab <dbl>, Saturn <dbl>, convertible <dbl>, ## # coupe <dbl>, hatchback <dbl>, sedan <dbl>, wagon <dbl>, and abbreviated ## # variable name ¹Cadillac ``` ] ] --- ## `worldbank` ```r data("worldbank", package = "rcis") worldbank ``` ``` ## # A tibble: 78 × 14 ## iso3c date iso2c country perc_en…¹ rnd_g…² percg…³ real_…⁴ gdp_c…⁵ top10…⁶ ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ARG 2005 AR Argentina 89.1 0.379 15.5 6198. 5110. 35 ## 2 ARG 2006 AR Argentina 88.7 0.400 22.1 7388. 5919. 33.9 ## 3 ARG 2007 AR Argentina 89.2 0.402 22.8 8182. 7245. 33.8 ## 4 ARG 2008 AR Argentina 90.7 0.421 21.6 8576. 9021. 32.5 ## 5 ARG 2009 AR Argentina 89.6 0.519 18.9 7904. 8225. 31.4 ## 6 ARG 2010 AR Argentina 89.5 0.518 17.9 8803. 10386. 32 ## 7 ARG 2011 AR Argentina 88.9 0.537 17.9 9528. 12849. 31 ## 8 ARG 2012 AR Argentina 89.0 0.609 16.5 9301. 13083. 29.7 ## 9 ARG 2013 AR Argentina 89.0 0.612 15.3 9367. 13080. 29.4 ## 10 ARG 2014 AR Argentina 87.7 0.613 16.1 8903. 12335. 29.9 ## # … with 68 more rows, 4 more variables: employment_ratio <dbl>, ## # life_exp <dbl>, pop_growth <dbl>, pop <dbl>, and abbreviated variable names ## # ¹perc_energy_fosfuel, ²rnd_gdpshare, ³percgni_adj_gross_savings, ## # ⁴real_netinc_percap, ⁵gdp_capita, ⁶top10perc_incshare ``` --- ## `summarize()`, `across()`, and `where()` .panelset[ .panel[.panel-name[Single condition] ```r worldbank %>% group_by(country) %>% summarize(across(.cols = where(is.numeric), .fns = mean, na.rm = TRUE)) ``` ``` ## # A tibble: 6 × 11 ## country perc_…¹ rnd_g…² percg…³ real_…⁴ gdp_c…⁵ top10…⁶ emplo…⁷ life_…⁸ ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Argentina 89.1 0.501 17.5 8560. 10648. 31.6 55.4 75.4 ## 2 China 87.6 1.67 48.3 3661. 5397. 30.8 69.8 74.7 ## 3 Indonesia 65.3 0.0841 30.5 2041. 2881. 31.2 62.5 69.5 ## 4 Norway 58.9 1.60 37.2 70775. 85622. 21.9 67.3 81.3 ## 5 United Kingdom 86.3 1.68 13.5 34542. 43416. 26.2 58.7 80.4 ## 6 United States 84.2 2.69 17.6 42824. 51285. 30.1 60.2 78.4 ## # … with 2 more variables: pop_growth <dbl>, pop <dbl>, and abbreviated ## # variable names ¹perc_energy_fosfuel, ²rnd_gdpshare, ## # ³percgni_adj_gross_savings, ⁴real_netinc_percap, ⁵gdp_capita, ## # ⁶top10perc_incshare, ⁷employment_ratio, ⁸life_exp ``` ] .panel[.panel-name[Compound condition] ```r worldbank %>% group_by(country) %>% summarize(across( .cols = where(is.numeric) & starts_with("perc"), .fn = mean, na.rm = TRUE )) ``` ``` ## # A tibble: 6 × 3 ## country perc_energy_fosfuel percgni_adj_gross_savings ## <chr> <dbl> <dbl> ## 1 Argentina 89.1 17.5 ## 2 China 87.6 48.3 ## 3 Indonesia 65.3 30.5 ## 4 Norway 58.9 37.2 ## 5 United Kingdom 86.3 13.5 ## 6 United States 84.2 17.6 ``` ] ] --- ## `across()` and `mutate()` ```r car_prices %>% mutate(across(.cols = Price:Doors, .fns = log10)) ``` ``` ## # A tibble: 804 × 18 ## Price Mileage Cylinder Doors Cruise Sound Leather Buick Cadil…¹ Chevy Pontiac ## <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> ## 1 4.36 4.30 0.778 0.602 1 0 0 1 0 0 0 ## 2 4.34 4.13 0.778 0.301 1 1 0 0 0 1 0 ## 3 4.46 4.50 0.602 0.301 1 1 1 0 0 0 0 ## 4 4.49 4.35 0.602 0.301 1 0 0 0 0 0 0 ## 5 4.52 4.25 0.602 0.301 1 1 1 0 0 0 0 ## 6 4.48 4.37 0.602 0.301 1 0 0 0 0 0 0 ## 7 4.52 4.24 0.602 0.301 1 1 1 0 0 0 0 ## 8 4.48 4.44 0.602 0.301 1 0 1 0 0 0 0 ## 9 4.48 4.40 0.602 0.301 1 0 0 0 0 0 0 ## 10 4.43 4.24 0.602 0.602 1 0 1 0 0 0 0 ## # … with 794 more rows, 7 more variables: Saab <int>, Saturn <int>, ## # convertible <int>, coupe <int>, hatchback <int>, sedan <int>, wagon <int>, ## # and abbreviated variable name ¹Cadillac ``` --- ## ~~`across()`~~ and `filter()` .panelset[ .panel[.panel-name[`if_any()`] ```r worldbank %>% filter(if_any(everything(), ~ !is.na(.x))) ``` ``` ## # A tibble: 78 × 14 ## iso3c date iso2c country perc_en…¹ rnd_g…² percg…³ real_…⁴ gdp_c…⁵ top10…⁶ ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ARG 2005 AR Argentina 89.1 0.379 15.5 6198. 5110. 35 ## 2 ARG 2006 AR Argentina 88.7 0.400 22.1 7388. 5919. 33.9 ## 3 ARG 2007 AR Argentina 89.2 0.402 22.8 8182. 7245. 33.8 ## 4 ARG 2008 AR Argentina 90.7 0.421 21.6 8576. 9021. 32.5 ## 5 ARG 2009 AR Argentina 89.6 0.519 18.9 7904. 8225. 31.4 ## 6 ARG 2010 AR Argentina 89.5 0.518 17.9 8803. 10386. 32 ## 7 ARG 2011 AR Argentina 88.9 0.537 17.9 9528. 12849. 31 ## 8 ARG 2012 AR Argentina 89.0 0.609 16.5 9301. 13083. 29.7 ## 9 ARG 2013 AR Argentina 89.0 0.612 15.3 9367. 13080. 29.4 ## 10 ARG 2014 AR Argentina 87.7 0.613 16.1 8903. 12335. 29.9 ## # … with 68 more rows, 4 more variables: employment_ratio <dbl>, ## # life_exp <dbl>, pop_growth <dbl>, pop <dbl>, and abbreviated variable names ## # ¹perc_energy_fosfuel, ²rnd_gdpshare, ³percgni_adj_gross_savings, ## # ⁴real_netinc_percap, ⁵gdp_capita, ⁶top10perc_incshare ``` ] .panel[.panel-name[`if_all()`] ```r worldbank %>% filter(if_all(everything(), ~ !is.na(.x))) ``` ``` ## # A tibble: 42 × 14 ## iso3c date iso2c country perc_en…¹ rnd_g…² percg…³ real_…⁴ gdp_c…⁵ top10…⁶ ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ARG 2005 AR Argentina 89.1 0.379 15.5 6198. 5110. 35 ## 2 ARG 2006 AR Argentina 88.7 0.400 22.1 7388. 5919. 33.9 ## 3 ARG 2007 AR Argentina 89.2 0.402 22.8 8182. 7245. 33.8 ## 4 ARG 2008 AR Argentina 90.7 0.421 21.6 8576. 9021. 32.5 ## 5 ARG 2009 AR Argentina 89.6 0.519 18.9 7904. 8225. 31.4 ## 6 ARG 2010 AR Argentina 89.5 0.518 17.9 8803. 10386. 32 ## 7 ARG 2011 AR Argentina 88.9 0.537 17.9 9528. 12849. 31 ## 8 ARG 2012 AR Argentina 89.0 0.609 16.5 9301. 13083. 29.7 ## 9 ARG 2013 AR Argentina 89.0 0.612 15.3 9367. 13080. 29.4 ## 10 ARG 2014 AR Argentina 87.7 0.613 16.1 8903. 12335. 29.9 ## # … with 32 more rows, 4 more variables: employment_ratio <dbl>, ## # life_exp <dbl>, pop_growth <dbl>, pop <dbl>, and abbreviated variable names ## # ¹perc_energy_fosfuel, ²rnd_gdpshare, ³percgni_adj_gross_savings, ## # ⁴real_netinc_percap, ⁵gdp_capita, ⁶top10perc_incshare ``` ] ] --- ## Exercise on `across()` iteration <img src="https://c.tenor.com/W-k3aSz4_r4AAAAC/heart-eyes-hearties.gif" width="80%" style="display: block; margin: auto;" />
08
:
00