Introduction to Computing for Information Science

INFO 5940
Cornell University

Intro to the course

  • Catherine Yu
  • Andrew Liu
Major topics

  • Elementary programming techniques (e.g. loops, conditional statements, functions)
  • Writing reusable, interpretable code
  • Problem-solving - debugging programs for errors
  • Obtaining, importing, and munging data from a variety of sources
  • Performing statistical analysis
  • Visualizing information
  • Creating interactive reports
  • Generating reproducible research
print("Hello world!")
## [1] "Hello world!"
# load packages
# estimate and print the linear model
lm(body_mass_g ~ flipper_length_mm, data = penguins) %>%
tidy() %>%
mutate(term = c("Intercept", "Flipper length (millimeters)")) %>%
kable(digits = 2, col.names = c("Variable", "Estimate", "Standard Error",
"T-statistic", "P-Value"))
# visualize the relationship
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm", se = FALSE, color = "black", alpha = .25) +
labs(x = "Flipper length (in millimeters)",
y = "Body mass (in grams)",
color = "Species")
Variable Estimate Standard Error T-statistic P-Value
Intercept -5780.83 305.81 -18.90 0
Flipper length (millimeters) 49.69 1.52 32.72 0

Who is this class for?

Who is this class for?


  • PhD student in sociology
  • Feels comfortable with regression and econometric methods in Stata
  • Will be analyzing a large-scale dataset for her dissertation
  • Seeks a reproducible workflow to manage all her exploratory and confirmatory analysis
Who is this class for?


  • Entering the MPS program
  • Hasn't taken a statistics class in five years
  • Expects to analyze a collection of published news articles
  • Wants to understand code samples he finds online so he can repurpose them for his own work
Who is this class for?


  • Third-year undergraduate student
  • Has taken general education math courses, plus the departmental methods course
  • Wants to work as a research assistant on a project exploring the onset of civil conflict
  • Needs to start contributing to a new research paper next quarter
Who is this class for?


  • Born and raised in Shenzhen, China
  • Background in psychology, plans to apply for doctoral programs in marketing
  • Is going to run 300 experiments on Amazon MTurk in the next six months
  • Expects to take courses in machine learning and Bayesian statistics which require a background in R
Succeeding in the class

Asking for help

Asking for help

Other resources

  • Collaboration is good - to a point
  • Learning from others/the internet
  • Collaboration is good - to a point
  • Learning from others/the internet

If you don't understand what the program is doing and are not prepared to explain it in detail,

you should not submit it.

  • Weekly programming assignments
  • Peer review
Programming and reproducible workflows

A series of instructions that specifies how to perform a computation

  • Input
  • Output
  • Math
  • Conditional execution
  • Repetition
Two different approaches

Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

Two different approaches

Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

Jane: a GUI workflow

  1. Searches for data files online
  2. Cleans the files in Excel
  3. Analyzes the data in Stata
  4. Writes her report in Google Docs
Two different approaches

Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

Jane: a GUI workflow

  1. Searches for data files online
  2. Cleans the files in Excel
  3. Analyzes the data in Stata
  4. Writes her report in Google Docs

Sally: a programmatic workflow

  1. Creates a folder specifically for this project
    • data
    • graphics
    • output
  2. Searches for data files online
  3. Cleans the files in R
  4. Analyzes the files in R
  5. Writes her report in R Markdown
  • Jane forgets how she transformed and analyzed the data
    • Extension of analysis will fall flat
  • Sally uses automation
    • Re-run programs
    • No mistakes
    • Much easier to implement in the long run
  • Are my results valid? Can it be replicated?
  • The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
  • Also allows the researcher to precisely replicate his/her analysis
Version control

  • Revisions in research
  • Tracking revisions
  • Multiple copies
    • analysis-1.r
    • analysis-2.r
    • analysis-3.r
  • Cloud storage (e.g. Dropbox, Google Drive, Box)
  • Version control software
    • Repository
  • Comments are the what
  • Code is the how
  • Computer code should also be self-documenting
  • Future-proofing
Badly documented code

tml1 <- get_timeline("MeCookieMonster", 3000)
tml2 <- get_timeline("Grover", 3000)
tml3 <- get_timeline("elmo", 3000)
tml4 <- get_timeline("CountVonCount", 3000)
ts_plot(group_by(bind_rows(select(tml1, created_at), select(tml2, created_at), select(tml3, created_at), select(tml4, created_at), .id = "screen_name"), screen_name), by = "months")
Good code

# get_to_sesame_street.R
# Program to retrieve recent tweets from Sesame Street characters
# load packages for data management and Twitter API
# retrieve most recent 3000 tweets of best Sesame Street residents
cookie <- get_timeline(
user = "MeCookieMonster",
n = 3000
grover <- get_timeline(
user = "Grover",
n = 3000
elmo <- get_timeline(
user = "elmo",
n = 3000
count_von_count <- get_timeline(
user = "CountVonCount",
n = 3000
# combine, group by character, and plot weekly tweet frequency
`Cookie Monster` = cookie %>% select(created_at),
Grover = grover %>% select(created_at),
Elmo = elmo %>% select(created_at),
`Count von Count` = count_von_count %>% select(created_at),
.id = "screen_name"
) %>%
group_by(screen_name) %>%
ts_plot(by = "months")
Good code

Software setup instructions

Intro to the course

