+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to Computing for Information Science

INFO 5940
Cornell University

1 / 41

Intro to the course

2 / 41

Me

3 / 41

TAs

  • Catherine Yu
  • Andrew Liu
4 / 41

6 / 41

Major topics

  • Elementary programming techniques (e.g. loops, conditional statements, functions)
  • Writing reusable, interpretable code
  • Problem-solving - debugging programs for errors
  • Obtaining, importing, and munging data from a variety of sources
  • Performing statistical analysis
  • Visualizing information
  • Creating interactive reports
  • Generating reproducible research
7 / 41
print("Hello world!")
## [1] "Hello world!"
8 / 41
# load packages
library(tidyverse)
library(palmerpenguins)
library(broom)
# estimate and print the linear model
lm(body_mass_g ~ flipper_length_mm, data = penguins) %>%
tidy() %>%
mutate(term = c("Intercept", "Flipper length (millimeters)")) %>%
kable(digits = 2, col.names = c("Variable", "Estimate", "Standard Error",
"T-statistic", "P-Value"))
# visualize the relationship
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm", se = FALSE, color = "black", alpha = .25) +
labs(x = "Flipper length (in millimeters)",
y = "Body mass (in grams)",
color = "Species")
9 / 41
Variable Estimate Standard Error T-statistic P-Value
Intercept -5780.83 305.81 -18.90 0
Flipper length (millimeters) 49.69 1.52 32.72 0

10 / 41

Who is this class for?

11 / 41

Who is this class for?

Jeri

  • PhD student in sociology
  • Feels comfortable with regression and econometric methods in Stata
  • Will be analyzing a large-scale dataset for her dissertation
  • Seeks a reproducible workflow to manage all her exploratory and confirmatory analysis
12 / 41

Who is this class for?

Ryan

  • Entering the MPS program
  • Hasn't taken a statistics class in five years
  • Expects to analyze a collection of published news articles
  • Wants to understand code samples he finds online so he can repurpose them for his own work
13 / 41

Who is this class for?

Fernando

  • Third-year undergraduate student
  • Has taken general education math courses, plus the departmental methods course
  • Wants to work as a research assistant on a project exploring the onset of civil conflict
  • Needs to start contributing to a new research paper next quarter
14 / 41

Who is this class for?

Fang

  • Born and raised in Shenzhen, China
  • Background in psychology, plans to apply for doctoral programs in marketing
  • Is going to run 300 experiments on Amazon MTurk in the next six months
  • Expects to take courses in machine learning and Bayesian statistics which require a background in R
15 / 41

Succeeding in the class

16 / 41

Asking for help

17 / 41

Asking for help

17 / 41

Other resources

18 / 41

19 / 41

Plagiarism

  • Collaboration is good - to a point
  • Learning from others/the internet
20 / 41

Plagiarism

  • Collaboration is good - to a point
  • Learning from others/the internet

If you don't understand what the program is doing and are not prepared to explain it in detail,

you should not submit it.

20 / 41

Evaluations

  • Weekly programming assignments
  • Peer review
21 / 41

Programming and reproducible workflows

22 / 41

23 / 41

Program

A series of instructions that specifies how to perform a computation

  • Input
  • Output
  • Math
  • Conditional execution
  • Repetition
24 / 41

25 / 41

26 / 41

27 / 41

28 / 41

29 / 41

Two different approaches

Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

30 / 41

Two different approaches

Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

Jane: a GUI workflow

  1. Searches for data files online
  2. Cleans the files in Excel
  3. Analyzes the data in Stata
  4. Writes her report in Google Docs
30 / 41

Two different approaches

Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

Jane: a GUI workflow

  1. Searches for data files online
  2. Cleans the files in Excel
  3. Analyzes the data in Stata
  4. Writes her report in Google Docs

Sally: a programmatic workflow

  1. Creates a folder specifically for this project
    • data
    • graphics
    • output
  2. Searches for data files online
  3. Cleans the files in R
  4. Analyzes the files in R
  5. Writes her report in R Markdown
30 / 41

31 / 41

Automation

  • Jane forgets how she transformed and analyzed the data
    • Extension of analysis will fall flat
  • Sally uses automation
    • Re-run programs
    • No mistakes
    • Much easier to implement in the long run
32 / 41

Reproducibility

  • Are my results valid? Can it be replicated?
  • The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
  • Also allows the researcher to precisely replicate his/her analysis
33 / 41

Version control

  • Revisions in research
  • Tracking revisions
  • Multiple copies
    • analysis-1.r
    • analysis-2.r
    • analysis-3.r
  • Cloud storage (e.g. Dropbox, Google Drive, Box)
  • Version control software
    • Repository
34 / 41

35 / 41

36 / 41

Documentation

  • Comments are the what
  • Code is the how
  • Computer code should also be self-documenting
  • Future-proofing
37 / 41

Badly documented code

library(tidyverse)
library(rtweet)
tml1 <- get_timeline("MeCookieMonster", 3000)
tml2 <- get_timeline("Grover", 3000)
tml3 <- get_timeline("elmo", 3000)
tml4 <- get_timeline("CountVonCount", 3000)
ts_plot(group_by(bind_rows(select(tml1, created_at), select(tml2, created_at), select(tml3, created_at), select(tml4, created_at), .id = "screen_name"), screen_name), by = "months")
38 / 41

Good code

# get_to_sesame_street.R
# Program to retrieve recent tweets from Sesame Street characters
# load packages for data management and Twitter API
library(tidyverse)
library(rtweet)
# retrieve most recent 3000 tweets of best Sesame Street residents
cookie <- get_timeline(
user = "MeCookieMonster",
n = 3000
)
grover <- get_timeline(
user = "Grover",
n = 3000
)
elmo <- get_timeline(
user = "elmo",
n = 3000
)
count_von_count <- get_timeline(
user = "CountVonCount",
n = 3000
)
# combine, group by character, and plot weekly tweet frequency
bind_rows(
`Cookie Monster` = cookie %>% select(created_at),
Grover = grover %>% select(created_at),
Elmo = elmo %>% select(created_at),
`Count von Count` = count_von_count %>% select(created_at),
.id = "screen_name"
) %>%
group_by(screen_name) %>%
ts_plot(by = "months")
39 / 41

Good code

40 / 41

Software setup instructions

41 / 41

Intro to the course

2 / 41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow