Getting data from the web: scraping

---

## Web scraping

* Data on a website with no API
* Still want a programmatic, reproducible way to obtain data
* Ability to scrape depends on the quality of the website

---

# HyperText Markup Language

---

## HTML

---

## Process of HTML

1. The web browser sends a request to the server that hosts the website
1. The server sends the browser an HTML document
1. The browser uses instructions in the HTML to render the website

---

## Components of HTML code

```html
<html>
  <head>
    <title>Title</title>
    <link rel="icon" type="icon" href="http://a" />
    <script src="https://c.js"></script>
  </head>
  <body>
    <div>
      <p>Click <b>here</b> now.</p>
      <span>Frozen</span>
    </div>
    <table style="width:100%">
      <tr>
        <td>Kristen</td>
        <td>Bell</td>
      </tr>
    </table>
  <img src="http://ia.media-imdb.com/images.png"/>
  </body>
</html>
```

---

## Components of HTML code

```html
<a href="http://github.com">GitHub</a>
```

* `<a></a>` - element name
* `href` - attribute (argument)
* `"http://github.com"` - attribute (value)
* `GitHub` - content

---

## Nested structure of HTML

* `html`
    * `head`
        * `title`
        * `link`
        * `script`
    * `body`
        * `div`
            * `p`
                * `b`
            * `span`
        * `table`
            * `tr`
                * `td`
                * `td`
        * `img`

---

## Find the content "here"

* `html`
    * `head`
        * `title`
        * `link`
        * `script`
    * `body`
        * `div`
            * `p`
                * <span style="color:red">**`b`**</span>
            * `span`
        * `table`
            * `tr`
                * `td`
                * `td`
        * `img`

---

## HTML only

---

# Cascading Style Sheets

---

## HTML + CSS

---

## CSS code

```css
span {
  color: #ffffff;
}

.num {
  color: #a8660d;
}

table.data {
  width: auto;
}

#firstname {
  background-color: yellow;
}
```

---

## CSS code

```html
<span class="bigname" id="shiny">Shiny</span>
```

* `<span></span>` - element name
* `bigname` - class (optional)
* `shiny` - id (optional)

---

## CSS selectors

```css
span
```

```css
.bigname
```

```css
span.bigname
```

```css
#shiny
```

---

## CSS selectors

Prefix | Matches
-------|--------
none   | element
.      | class
#      | id

> [CSS diner](http://flukeout.github.io)

---

## Find the CSS selector

```html
<body>
    <table id="content">
        <tr class='name'>
            <td class='firstname'>
                Kurtis
            </td>
            <td class='lastname'>
                McCoy
            </td>
        </tr>
        <tr class='name'>
            <td class='firstname'>
                Leah
            </td>
            <td class='lastname'>
                Guerrero
            </td>
        </tr>
    </table>
</body>
```

]

1. The entire table
1. Just the element containing first names

]

---

.footnote[Source: [James Webb Space Telescope/NASA](https://webbtelescope.org/contents/media/images/2022/038/01G7JGTH21B5GN9VCYAHBXKSD1)]

---

## Scraping presidential statements

.footnote[[Space exploration](https://www.presidency.ucsb.edu/advanced-search?field-keywords=%22space+exploration%22&field-keywords2=&field-keywords3=&from%5Bdate%5D=&to%5Bdate%5D=&person2=&items_per_page=100)]

---

# `rvest` for web scraping

---

## Using `rvest` to read HTML

1. Collect the HTML source code of a webpage
2. Read the HTML of the page
3. Select and keep certain elements of the page that are of interest

---

## Get the page

```r
dwight <- read_html(x = "https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration")
dwight
```

```
## {html_document}
## <html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
## [1] <head profile="http://www.w3.org/1999/xhtml/vocab">\n<meta charset="utf-8 ...
## [2] <body class="html not-front not-logged-in one-sidebar sidebar-first page- ...
```

---

## Find page elements

`rvest` lets you find elements by

1. HTML elements
1. HTML attributes
1. CSS selectors

---

## Find `a` elements

```r
html_elements(x = dwight, css = "a")
```

```
## {xml_nodeset (73)}
##  [1] <a href="#main-content" class="element-invisible element-focusable">Skip ...
##  [2] <a href="https://www.presidency.ucsb.edu/">The American Presidency Proje ...
##  [3] <a class="btn btn-default" href="https://www.presidency.ucsb.edu/about"> ...
##  [4] <a class="btn btn-default" href="/advanced-search"><span class="glyphico ...
##  [5] <a href="https://www.ucsb.edu/" target="_blank"><img alt="ucsb wordmark  ...
##  [6] <a href="/documents" class="active-trail dropdown-toggle" data-toggle="d ...
##  [7] <a href="/documents/presidential-documents-archive-guidebook">Guidebook</a>
##  [8] <a href="/documents/category-attributes">Category Attributes</a>
##  [9] <a href="/statistics">Statistics</a>
## [10] <a href="/media" title="">Media Archive</a>
## [11] <a href="/presidents" title="">Presidents</a>
## [12] <a href="/analyses" title="">Analyses</a>
## [13] <a href="https://giving.ucsb.edu/Funds/Give?id=185" title="">GIVE</a>
## [14] <a href="/documents/presidential-documents-archive-guidebook" title="">A ...
## [15] <a href="/documents" title="" class="active-trail">Categories</a>
## [16] <a href="/documents/category-attributes" title="">Attributes</a>
## [17] <a href="/documents/app-categories/presidential" title="Presidential (73 ...
## [18] <a href="/documents/app-categories/spoken-addresses-and-remarks/presiden ...
## [19] <a href="/documents/app-categories/spoken-addresses-and-remarks/presiden ...
## [20] <a href="/documents/app-categories/written-presidential-orders/president ...
## ...
```

---

## SelectorGadget

* GUI tool used to identify CSS selector combinations from a webpage
1. Read [here](https://rvest.tidyverse.org/articles/articles/selectorgadget.html)
1. Drag **SelectorGadget** link into your browser's bookmark bar

---

## Using SelectorGadget

1. Navigate to a webpage
1. Open the SelectorGadget bookmark
1. Click on the item to scrape
1. Click on yellow items you do not want to scrape
1. Click on additional items that you do want to scrape
1. Rinse and repeat until only the items you want to scrape are highlighted in yellow
1. Copy the selector to use with `html_elements()`

---

## Find the CSS selector

Use Selector Gadget to find the CSS selector for the document's *speaker*.

Then, modify an argument in `html_elements` to look for this more specific CSS selector.

---

## Find the CSS selector

```r
html_elements(x = dwight, css = ".diet-title a")
```

```
## {xml_nodeset (1)}
## [1] <a href="/people/president/dwight-d-eisenhower">Dwight D. Eisenhower</a>
```

---

## Get attributes and text of elements

```r
# identify element with speaker name
speaker <- html_elements(dwight, ".diet-title a") %>% 
  html_text2() # Select text of element

speaker
```

```
## [1] "Dwight D. Eisenhower"
```

---

## Get attributes and text of elements

```r
speaker_link <- html_elements(dwight, ".diet-title a") %>% 
  html_attr("href")

speaker_link
```

```
## [1] "/people/president/dwight-d-eisenhower"
```

---

## Date of statement

```r
date <- html_elements(x = dwight, css = ".date-display-single") %>%
  html_text2() %>% # Grab element text
  mdy() # Format using lubridate
date
```

```
## [1] "1958-04-02"
```

---

## Speaker name

```r
speaker <- html_elements(x = dwight, css = ".diet-title a") %>%
  html_text2()
speaker
```

```
## [1] "Dwight D. Eisenhower"
```
    
---

## Title

```r
title <- html_elements(x = dwight, css = "h1") %>%
  html_text2()
title
```

```
## [1] "Special Message to the Congress Relative to Space Science and Exploration."
```

---

## Text

```r
text <- html_elements(x = dwight, css = "div.field-docs-content") %>%
  html_text2()

# This is a long document, so let's just display the first 1,000 characters
text %>% str_sub(1, 1000) 
```

```
## [1] "To the Congress of the United States:\n\nRecent developments in long-range rockets for military purposes have for the first time provided man with new machinery so powerful that it can put satellites into orbit, and eventually provide the means for space exploration. The United States of America and the Union of Soviet Socialist Republics have already successfully placed in orbit a number of earth satellites. In fact, it is now within the means of any technologically advanced nation to embark upon practicable programs for exploring outer space. The early enactment of appropriate legislation will help assure that the United States takes full advantage of the knowledge of its scientists, the skill of its engineers and technicians, and the resourcefulness of its industry in meeting the challenges of the space age.\n\nDuring the past several months my Special Assistant for Science and Technology and the President's Science Advisory Committee, of which he is the Chairman, have been conducting a"
```
    
---

## Make a function

Make a function called `scrape_docs` that

- Accepts a URL of an individual document
- Scrapes the page
- Returns a data frame containing the document's
    - Date
    - Speaker
    - Title
    - Full text

---

## Practice scraping data

1. Look up the cost of living for Ithaca, NY on [Sperling's Best Places](http://www.bestplaces.net/)
1. Extract it with `html_elements()` and `html_text2()`

---

## Practice scraping data

```r
ithaca <- read_html("https://www.bestplaces.net/cost_of_living/city/new_york/ithaca")

col <- html_elements(ithaca, css = "#mainContent_dgCostOfLiving tr:nth-child(2) td:nth-child(2)")
html_text2(col)
```

```
## [1] "98.4"
```

```r
# or use a piped operation
ithaca %>%
  html_elements(css = "#mainContent_dgCostOfLiving tr:nth-child(2) td:nth-child(2)") %>%
  html_text2()
```

```
## [1] "98.4"
```

---

## Tables

```r
tables <- html_elements(ithaca, css = "table")

tables %>%
  # get the first table
  nth(1) %>%
  # convert to data frame
  html_table(header = TRUE)
```

```
## # A tibble: 8 × 4
##   `COST OF LIVING` Ithaca   `New York` USA     
##   <chr>            <chr>    <chr>      <chr>   
## 1 Overall          98.4     121.5      100     
## 2 Grocery          104.6    103.8      100     
## 3 Health           116.3    120.7      100     
## 4 Housing          100.8    127.9      100     
## 5 Median Home Cost $294,100 $373,000   $291,700
## 6 Utilities        100      115.9      100     
## 7 Transportation   80.8     140.7      100     
## 8 Miscellaneous    151.5    121.8      100
```

---

## Extract climate statistics

> Extract the climate statistics of your hometown as a data frame with useful column names

---

## Extract climate statistics

```r
ithaca_climate <- read_html("http://www.bestplaces.net/climate/city/new_york/ithaca")
climate <- html_elements(ithaca_climate, css = "table")
html_table(climate, header = TRUE, fill = TRUE)[[1]]
## # A tibble: 9 × 3
##   ``                            `Ithaca, New York` `United States`
##   <chr>                         <chr>              <chr>          
## 1 Rainfall                      37.4 in.           38.1 in.       
## 2 Snowfall                      63.3 in.           27.8 in.       
## 3 Precipitation                 159.3 days         106.2 days     
## 4 Sunny                         155 days           205 days       
## 5 Avg. July High                80.9°              85.8°          
## 6 Avg. Jan. Low                 14.7°              21.7°          
## 7 Comfort Index (higher=better) 6.4                7              
## 8 UV Index                      3.2                4.3            
## 9 Elevation                     410 ft.            2443 ft.
```

]

```r
ithaca_climate %>%
  html_elements(css = "table") %>%
  nth(1) %>%
  html_table(header = TRUE)
## # A tibble: 9 × 3
##   ``                            `Ithaca, New York` `United States`
##   <chr>                         <chr>              <chr>          
## 1 Rainfall                      37.4 in.           38.1 in.       
## 2 Snowfall                      63.3 in.           27.8 in.       
## 3 Precipitation                 159.3 days         106.2 days     
## 4 Sunny                         155 days           205 days       
## 5 Avg. July High                80.9°              85.8°          
## 6 Avg. Jan. Low                 14.7°              21.7°          
## 7 Comfort Index (higher=better) 6.4                7              
## 8 UV Index                      3.2                4.3            
## 9 Elevation                     410 ft.            2443 ft.
```

]
]

---

## Random observations on scraping

* Make sure you've obtained only what you want
* If you are having trouble parsing, try selecting a smaller subset of the thing you are seeking
* Confirm that there is no R package and no API