Introduction to Computing for Information Science

.title[
# Introduction to Computing for Information Science
]
.author[
### INFO 5940 <br /> Cornell University
]

---

# Intro to the course

---

## Me

---

## TAs

- Catherine Yu
- Andrew Liu

---

## Course site

> https://info5940.infosci.cornell.edu/

---

---

## Major topics

* Elementary programming techniques (e.g. loops, conditional statements, functions)
* Writing reusable, interpretable code
* Problem-solving - debugging programs for errors
* Obtaining, importing, and munging data from a variety of sources
* Performing statistical analysis
* Visualizing information
* Creating interactive reports
* Generating reproducible research

---

```r
print("Hello world!")
```

```
## [1] "Hello world!"
```

---

```r
# load packages
library(tidyverse)
library(palmerpenguins)
library(broom)

# estimate and print the linear model
lm(body_mass_g ~ flipper_length_mm, data = penguins) %>%
  tidy() %>%
  mutate(term = c("Intercept", "Flipper length (millimeters)")) %>%
  kable(digits = 2, col.names = c("Variable", "Estimate", "Standard Error",
                                         "T-statistic", "P-Value"))

# visualize the relationship
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", alpha = .25) +
  labs(x = "Flipper length (in millimeters)",
       y = "Body mass (in grams)",
       color = "Species")
```

---

|Variable                     | Estimate| Standard Error| T-statistic| P-Value|
|:----------------------------|--------:|--------------:|-----------:|-------:|
|Intercept                    | -5780.83|         305.81|      -18.90|       0|
|Flipper length (millimeters) |    49.69|           1.52|       32.72|       0|

---

# Who is this class for?

---

## Who is this class for?

### Jeri

]

* PhD student in sociology
* Feels comfortable with regression and econometric methods in Stata
* Will be analyzing a large-scale dataset for her dissertation
* Seeks a reproducible workflow to manage all her exploratory and confirmatory analysis

]

---

## Who is this class for?

### Ryan

]

* Entering the [MPS program](https://infosci.cornell.edu/masters/mps)
* Hasn't taken a statistics class in five years
* Expects to analyze a collection of published news articles 
* Wants to understand code samples he finds online so he can repurpose them for his own work

]

---

## Who is this class for?

### Fernando

]

* Third-year undergraduate student
* Has taken general education math courses, plus the departmental methods course
* Wants to work as a research assistant on a project exploring the onset of civil conflict
* Needs to start contributing to a new research paper next quarter

]

---

## Who is this class for?

### Fang

]

* Born and raised in Shenzhen, China
* Background in psychology, plans to apply for doctoral programs in marketing
* Is going to run 300 experiments on Amazon MTurk in the next six months
* Expects to take courses in machine learning and Bayesian statistics which require a background in R

]

---

# Succeeding in the class

---

## Asking for help

]

<center>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">15 min rule: when stuck, you HAVE to try on your own for 15 min; after 15 min, you HAVE to ask for help.- Brain AMA <a href="https://t.co/MS7FnjXoGH">pic.twitter.com/MS7FnjXoGH</a></p>&mdash; Rachel Thomas (@math_rachel) <a href="https://twitter.com/math_rachel/status/764931533383749632">August 14, 2016</a></blockquote>
<script async src="http://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>

]

---

## Other resources

* [Google](https://www.google.com)
* [StackOverflow](http://stackoverflow.com/)
* Me
* TAs
* Fellow students
* [Class discussion page](https://github.com/cis-ds/Discussion)
    * [How to properly ask for help](https://info5940.infosci.cornell.edu/faq/asking-questions/)

---

---

## Plagiarism

* Collaboration is good - *to a point*
* Learning from others/the internet

.task[If you don't understand what the program is doing and are not prepared to explain it in detail,

**you should not submit it**.]

---

## Evaluations

* Weekly programming assignments
* Peer review

---

# Programming and reproducible workflows

---

---

## Program

> A series of instructions that specifies how to perform a computation

* Input
* Output
* Math
* Conditional execution
* Repetition

---

---

---

---

---

---

## Two different approaches

> Write a report analyzing the relationship between ice cream consumption and crime rates in New York City.

### Jane: a GUI workflow

1. Searches for data files online
1. Cleans the files in Excel
1. Analyzes the data in Stata
1. Writes her report in Google Docs

]

### Sally: a programmatic workflow

1. Creates a folder specifically for this project
    * `data`
    * `graphics`
    * `output`
1. Searches for data files online
1. Cleans the files in R
1. Analyzes the files in R
1. Writes her report in R Markdown

]

---

---

## Automation

* Jane forgets how she transformed and analyzed the data
    * Extension of analysis will fall flat
* Sally uses *automation*
    * Re-run programs
    * No mistakes
    * Much easier to implement *in the long run*

---

## Reproducibility

* Are my results valid? Can it be *replicated*?
* The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
* Also allows the researcher to precisely replicate his/her analysis

---

## Version control

* Revisions in research
* Tracking revisions
* Multiple copies
    * `analysis-1.r`
    * `analysis-2.r`
    * `analysis-3.r`
* Cloud storage (e.g. Dropbox, Google Drive, Box)
* Version control software
    * Repository

---

---

---

## Documentation

* *Comments* are the what
* *Code* is the how
* Computer code should also be *self-documenting*
* Future-proofing

---

## Badly documented code

```r
library(tidyverse)
library(rtweet)
tml1 <- get_timeline("MeCookieMonster", 3000)
tml2 <- get_timeline("Grover", 3000)
tml3 <- get_timeline("elmo", 3000)
tml4 <- get_timeline("CountVonCount", 3000)
ts_plot(group_by(bind_rows(select(tml1, created_at), select(tml2, created_at), select(tml3, created_at), select(tml4, created_at), .id = "screen_name"), screen_name), by = "months")
```

---

## Good code

```r
# get_to_sesame_street.R
# Program to retrieve recent tweets from Sesame Street characters

# load packages for data management and Twitter API
library(tidyverse)
library(rtweet)

# retrieve most recent 3000 tweets of best Sesame Street residents
cookie <- get_timeline(
  user = "MeCookieMonster",
  n = 3000
)

grover <- get_timeline(
  user = "Grover",
  n = 3000
)

elmo <- get_timeline(
  user = "elmo",
  n = 3000
)

count_von_count <- get_timeline(
  user = "CountVonCount",
  n = 3000
)

# combine, group by character, and plot weekly tweet frequency
bind_rows(
  `Cookie Monster` = cookie %>% select(created_at),
  Grover = grover %>% select(created_at),
  Elmo = elmo %>% select(created_at),
  `Count von Count` = count_von_count %>% select(created_at),
  .id = "screen_name"
) %>%
  group_by(screen_name) %>%
  ts_plot(by = "months")
```
]

---

## Good code

---

## Software setup instructions

* [Installing software](/setup/)
* [RStudio Workbench](https://rstudio-workbench.infosci.cornell.edu/)
* Local installation