class: center, middle, inverse, title-slide .title[ # Reproducible workflow ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # Think of your R processes as livestock, not pets --- ## Pets or cattle? <img src="../../../../../../../../img/pets-cattle.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## R Session * R process (e.g. "session") * Treat individual R processes and workspaces as disposable -- ### Workspace * Libraries with `library()` * User-created objects -- .task[Treat your source code as precious, not the workspace] --- class: middle <img src="../../../../../../../../img/if-you-liked-it-you-should-have-saved-the-source-for-it.jpg" width="80%" style="display: block; margin: auto;" /> --- ## Save code, not workspace * Enforces reproducibility * Easy to regenerate on-demand * Always save commands * Always start R with a blank state * Restart R often --- ## Always start R with a blank slate <img src="../../../../../../../../img/rstudio-workspace.png" width="45%" style="display: block; margin: auto;" /> .footnote[Source: [R for Data Science](https://r4ds.had.co.nz/workflow-projects.html#what-is-real)] --- ## Bad approaches ```r rm(list = ls()) ``` * Good intent, but poor execution * Only deletes user-created objects * Enforces hidden dependencies on things you ran before `rm(list = ls())` --- ## Avoid unknown unknowns Write every script like its running in a fresh process -- Best way to ensure this: **write every script in a fresh process** -- ### Storing computationally demanding output * `write_rds()` & `read_rds()` * `cache: true` --- class: inverse, middle # Working directories and filepaths --- ## How to store work * Split work into projects * **We already do this** * But why? --- ## Working directory - Directory in a hierarchical file system dynamically associated with a process - `getwd()` and `setwd()` -- ## `setwd()` ```r library(tidyverse) setwd("/Users/bensoltoff/cuddly_broccoli/verbose_funicular/foofy/data") foofy <- read_csv("raw_foofy_data.csv") p <- ggplot(foofy, aes(x, y)) + geom_point() ggsave("../figs/foofy_scatterplot.png") ``` --- ## Relative and absolute paths -- #### Relative path ``` data_world_bank/API_ABW_DS2_en_csv_v2_4346306.csv ``` -- #### Absolute path ``` /Users/soltoffbc/Projects/Computing for Information Sciences/homework-seeds/hw04/data_world_bank/API_ABW_DS2_en_csv_v2_4346306.csv ``` -- Absolute paths will not work for anyone besides the original author - and even for them they will eventually break -- **Use relative filepaths** --- class: inverse, middle # Project-based workflows --- ### File system discipline Put all files related to a single project in a designated folder -- ### Working directory intentionality When working on project A, make sure working directory is set to project A's folder -- ### File path discipline All paths are relative - relative to the project's folder -- ### Rationale for workflow * Ensures portability * Reliable, polite behavior -- ### RStudio Projects * `.Rproj` --- ## Use safe filepaths * Avoid `setwd()` * Split work into projects * Declare each folder as a project * Use `here::here()` --- class: small ## `here::here()` ```r library(here) here() ``` ``` ## [1] "/Users/soltoffbc/Projects/Computing for Social Sciences/course-site" ``` -- #### Build a file path ```r here("static", "extras", "awesome.txt") ## [1] "/Users/soltoffbc/Projects/Computing for Social Sciences/course-site/static/extras/awesome.txt" cat(readLines(here("static", "extras", "awesome.txt"))) ## OMG this is so awesome! ``` -- #### What if we change the working directory? ```r setwd(here("static")) getwd() ## [1] "/Users/soltoffbc/Projects/Computing for Social Sciences/course-site/static" cat(readLines(here("static", "extras", "awesome.txt"))) ## OMG this is so awesome! ``` --- ## Filepaths and Quarto documents ``` data/ scotus.csv analysis/ exploratory-analysis.qmd final-report.qmd scotus.Rproj ``` -- * `.qmd` and assumption of working directory * Run `read_csv("data/scotus.csv")` * Run `read_csv(here("data", "scotus.csv"))` --- ## Here's a GIF of Nicolas Cage <img src="https://media.giphy.com/media/l2Je5sSem0BybIKJi/giphy.gif" width="80%" style="display: block; margin: auto;" />
12
:
00
--- class: inverse, middle # Personal R admin --- ## R startup procedures * Customized startup * `.Renviron` * `.Rprofile` --- ## `.Renviron` * Define sensitive information * Set R specific environmental variables * Does not use R code syntax * `usethis::edit_r_environ()` -- ## Example `.Renviron` ```shell R_HISTSIZE=100000 GITHUB_PAT=abc123 R_LIBS_USER=~/R/%p/%v ``` --- ## `.Rprofile` * R code to run when R starts up * Runs after `.Renviron` * Multiple `.Rprofile` files * Home directory (`~/.Rprofile`) * Each R Project folder * `usethis::edit_r_profile()` --- ## Common items in `.Rprofile` 1. Set a default CRAN mirror 1. Write a welcome message 1. Customize their R prompt 1. Change options, screen width, numeric display 1. Store API keys/tokens that are necessary for only a single project --- ## Git tracking of `.Rprofile` <img src="https://media.giphy.com/media/13e1PQJrKtqYKyO0FY/giphy.gif" width="80%" style="display: block; margin: auto;" /> --- ## A couple of things America got right: [cars and freedom](https://www.youtube.com/watch?v=OnQXRxW9VcQ) <img src="https://media.giphy.com/media/Sd8uqMJqpGpP2/giphy.gif" width="80%" style="display: block; margin: auto;" />
05
:
00