class: center, middle, inverse, title-slide .title[ # Introduction to Computing for Information Science ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # Intro to the course --- ## Me <img src="../../../../../../../../img/ben-soltoff.jpg" width="50%" height="50%" style="display: block; margin: auto;" /> --- ## TAs - Catherine Yu - Andrew Liu --- ## Course site > https://info5940.infosci.cornell.edu/ --- <img src="../../../../../../../../img/bruce_computer.gif" width="80%" style="display: block; margin: auto;" /> --- ## Major topics * Elementary programming techniques (e.g. loops, conditional statements, functions) * Writing reusable, interpretable code * Problem-solving - debugging programs for errors * Obtaining, importing, and munging data from a variety of sources * Performing statistical analysis * Visualizing information * Creating interactive reports * Generating reproducible research --- ```r print("Hello world!") ``` ``` ## [1] "Hello world!" ``` --- ```r # load packages library(tidyverse) library(palmerpenguins) library(broom) # estimate and print the linear model lm(body_mass_g ~ flipper_length_mm, data = penguins) %>% tidy() %>% mutate(term = c("Intercept", "Flipper length (millimeters)")) %>% kable(digits = 2, col.names = c("Variable", "Estimate", "Standard Error", "T-statistic", "P-Value")) # visualize the relationship ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes(color = species)) + geom_smooth(method = "lm", se = FALSE, color = "black", alpha = .25) + labs(x = "Flipper length (in millimeters)", y = "Body mass (in grams)", color = "Species") ``` --- |Variable | Estimate| Standard Error| T-statistic| P-Value| |:----------------------------|--------:|--------------:|-----------:|-------:| |Intercept | -5780.83| 305.81| -18.90| 0| |Flipper length (millimeters) | 49.69| 1.52| 32.72| 0| <img src="index_files/figure-html/penguins-example-1.png" width="75%" style="display: block; margin: auto;" /> --- class: inverse, middle # Who is this class for? --- ## Who is this class for? ### Jeri .pull-left[ <img src="../../../../../../../../img/stock-photos/000022.jpg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ * PhD student in sociology * Feels comfortable with regression and econometric methods in Stata * Will be analyzing a large-scale dataset for her dissertation * Seeks a reproducible workflow to manage all her exploratory and confirmatory analysis ] --- ## Who is this class for? ### Ryan .pull-left[ <img src="../../../../../../../../img/stock-photos/000284.jpg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ * Entering the [MPS program](https://infosci.cornell.edu/masters/mps) * Hasn't taken a statistics class in five years * Expects to analyze a collection of published news articles * Wants to understand code samples he finds online so he can repurpose them for his own work ] --- ## Who is this class for? ### Fernando .pull-left[ <img src="../../../../../../../../img/stock-photos/000232.jpg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ * Third-year undergraduate student * Has taken general education math courses, plus the departmental methods course * Wants to work as a research assistant on a project exploring the onset of civil conflict * Needs to start contributing to a new research paper next quarter ] --- ## Who is this class for? ### Fang .pull-left[ <img src="../../../../../../../../img/stock-photos/000251.jpg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ * Born and raised in Shenzhen, China * Background in psychology, plans to apply for doctoral programs in marketing * Is going to run 300 experiments on Amazon MTurk in the next six months * Expects to take courses in machine learning and Bayesian statistics which require a background in R ] --- class: inverse, middle # Succeeding in the class --- ## Asking for help .pull-left[ <center> <iframe width="560" height="315" src="https://www.youtube.com/embed/ZS8QHRtzcPg?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> </center> ] -- .pull-right[ <center> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">15 min rule: when stuck, you HAVE to try on your own for 15 min; after 15 min, you HAVE to ask for help.- Brain AMA <a href="https://t.co/MS7FnjXoGH">pic.twitter.com/MS7FnjXoGH</a></p>— Rachel Thomas (@math_rachel) <a href="https://twitter.com/math_rachel/status/764931533383749632">August 14, 2016</a></blockquote> <script async src="http://platform.twitter.com/widgets.js" charset="utf-8"></script> </center> ] --- ## Other resources * [Google](https://www.google.com) * [StackOverflow](http://stackoverflow.com/) * Me * TAs * Fellow students * [Class discussion page](https://github.com/cis-ds/Discussion) * [How to properly ask for help](https://info5940.infosci.cornell.edu/faq/asking-questions/) --- class: middle <img src="../../../../../../../../img/plagiarism.jpg" width="70%" style="display: block; margin: auto;" /> --- ## Plagiarism * Collaboration is good - *to a point* * Learning from others/the internet -- .task[If you don't understand what the program is doing and are not prepared to explain it in detail, **you should not submit it**.] --- ## Evaluations * Weekly programming assignments * Peer review --- class: inverse, middle # Programming and reproducible workflows --- class: middle <img src="../../../../../../../../img/data-science/base.png" width="80%" style="display: block; margin: auto;" /> --- ## Program > A series of instructions that specifies how to perform a computation * Input * Output * Math * Conditional execution * Repetition --- class: middle <img src="../../../../../../../../img/windows_3.1.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="../../../../../../../../img/mac_os_x.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="../../../../../../../../img/android_phones.jpg" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="../../../../../../../../img/stata14.png" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="../../../../../../../../img/shell.png" width="80%" style="display: block; margin: auto;" /> --- ## Two different approaches > Write a report analyzing the relationship between ice cream consumption and crime rates in New York City. -- .pull-left[ ### Jane: a GUI workflow 1. Searches for data files online 1. Cleans the files in Excel 1. Analyzes the data in Stata 1. Writes her report in Google Docs ] -- .pull-right[ ### Sally: a programmatic workflow 1. Creates a folder specifically for this project * `data` * `graphics` * `output` 1. Searches for data files online 1. Cleans the files in R 1. Analyzes the files in R 1. Writes her report in R Markdown ] --- class: middle <img src="https://i.imgflip.com/1szkun.jpg" width="70%" height="70%" style="display: block; margin: auto;" /> --- ## Automation * Jane forgets how she transformed and analyzed the data * Extension of analysis will fall flat * Sally uses *automation* * Re-run programs * No mistakes * Much easier to implement *in the long run* --- ## Reproducibility * Are my results valid? Can it be *replicated*? * The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them * Also allows the researcher to precisely replicate his/her analysis --- ## Version control * Revisions in research * Tracking revisions * Multiple copies * `analysis-1.r` * `analysis-2.r` * `analysis-3.r` * Cloud storage (e.g. Dropbox, Google Drive, Box) * Version control software * Repository --- class: middle <img src="../../../../../../../../img/vcs-local.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- class: middle <img src="../../../../../../../../img/vcs-distributed.png" width="50%" height="50%" style="display: block; margin: auto;" /> --- ## Documentation * *Comments* are the what * *Code* is the how * Computer code should also be *self-documenting* * Future-proofing --- ## Badly documented code ```r library(tidyverse) library(rtweet) tml1 <- get_timeline("MeCookieMonster", 3000) tml2 <- get_timeline("Grover", 3000) tml3 <- get_timeline("elmo", 3000) tml4 <- get_timeline("CountVonCount", 3000) ts_plot(group_by(bind_rows(select(tml1, created_at), select(tml2, created_at), select(tml3, created_at), select(tml4, created_at), .id = "screen_name"), screen_name), by = "months") ``` --- ## Good code .tiny[ ```r # get_to_sesame_street.R # Program to retrieve recent tweets from Sesame Street characters # load packages for data management and Twitter API library(tidyverse) library(rtweet) # retrieve most recent 3000 tweets of best Sesame Street residents cookie <- get_timeline( user = "MeCookieMonster", n = 3000 ) grover <- get_timeline( user = "Grover", n = 3000 ) elmo <- get_timeline( user = "elmo", n = 3000 ) count_von_count <- get_timeline( user = "CountVonCount", n = 3000 ) # combine, group by character, and plot weekly tweet frequency bind_rows( `Cookie Monster` = cookie %>% select(created_at), Grover = grover %>% select(created_at), Elmo = elmo %>% select(created_at), `Count von Count` = count_von_count %>% select(created_at), .id = "screen_name" ) %>% group_by(screen_name) %>% ts_plot(by = "months") ``` ] --- ## Good code <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Software setup instructions * [Installing software](/setup/) * [RStudio Workbench](https://rstudio-workbench.infosci.cornell.edu/) * Local installation