An introduction to machine learning

---

# What is machine learning?

---

---

---

---

---

---

---

## Types of machine learning

- Supervised
- Unsupervised

---

---

## Examples of supervised learning

- Will a user click on this ad?
- Will a police officer engage in misconduct in the next six months?
- How many individuals will become infected with COVID-19 in the next week?

---

## Two modes

### Classification

Will this home sell in the next 30 days?

]

### Regression

What will the sale price be for this home?

]

---

## Two cultures

### Statistics

- model first
- inference emphasis

]

### Machine Learning

- data first
- prediction emphasis

]

---
name: train-love
background-image: url(images/train.jpg)
background-size: contain
background-color: #f6f6f6

---
template: train-love
class: center, top

# Statistics

---

> *"Statisticians, like artists, have the bad habit of falling in love with their models."*
>
> &mdash; George Box

---

# `tidymodels`

---

---

## Predictive modeling

```r
library(tidymodels)
```

---

# The Bechdel test

---

## The Bechdel test

.footnote[Source:[FiveThirtyEight](https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/)]

---

## The Bechdel test

1. It has to have at least two named women in it
1. Who talk to each other
1. About something besides a man

```r
library(rcis)
data("bechdel")
glimpse(bechdel)
## Rows: 1,394
## Columns: 10
## $ year          <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …
## $ title         <chr> "12 Years a Slave", "2 Guns", "42", …
## $ test          <fct> Fail, Fail, Fail, Fail, Fail, Pass, …
## $ budget_2013   <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…
## $ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…
## $ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…
## $ rated         <chr> "R", "R", "PG-13", "PG-13", "R", "R"…
## $ metascore     <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …
## $ imdb_rating   <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…
## $ genre         <chr> "Biography", "Action", "Biography", …
```

---

## Bechdel test data

- N = 1394
- 1 categorical outcome: `test`
- 9 predictors

]

]

---

# What is the goal of machine learning?

## Build .display[models] that

## generate .display[accurate predictions]

## for .display[future, yet-to-be-seen data].

---

## Machine learning

We'll use this goal to drive learning of 3 core `tidymodels` packages:

- `parsnip`
- `rsample`
- `yardstick`

---

## 🔨 Build models with `parsnip`

---

## parsnip

---

## `glm()`

```r
glm(test ~ metascore, family = binomial, data = bechdel)
## 
## Call:  glm(formula = test ~ metascore, family = binomial, data = bechdel)
## 
## Coefficients:
## (Intercept)    metascore  
##    0.052274    -0.004563  
## 
## Degrees of Freedom: 1393 Total (i.e. Null);  1392 Residual
## Null Deviance:	    1916 
## Residual Deviance: 1914 	AIC: 1918
```

---

## To specify a model with `parsnip`

1. Pick a **model**
1. Set the **engine**
1. Set the **mode** (if needed)

---

## To specify a model with `parsnip`

```r
logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```

---

## To specify a model with `parsnip`

```r
decision_tree() %>%
  set_engine("C5.0") %>%
  set_mode("classification")
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---

## To specify a model with `parsnip`

```r
nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")
## K-Nearest Neighbor Model Specification (classification)
## 
## Computational engine: kknn
```

---

## 1\. Pick a model

All available models are listed at

<https://www.tidymodels.org/find/parsnip/>

---

## `logistic_reg()`

Specifies a model that uses logistic regression

```r
logistic_reg(penalty = NULL, mixture = NULL)
```

---

## `logistic_reg()`

Specifies a model that uses logistic regression

```r
logistic_reg(
  mode = "classification", # "default" mode, if exists
  penalty = NULL, # model hyper-parameter
  mixture = NULL # model hyper-parameter
)
```

---

## `set_engine()`

Adds an engine to power or implement the model.

```r
logistic_reg() %>% set_engine(engine = "glm")
```

Set the engine when you define the model type.

```r
logistic_reg(engine = "glm")
```

---

## `set_mode()`

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

```r
logistic_reg() %>% set_mode(mode = "classification")
```

---

## ⏱ Your turn 1

Run the chunk in your .qmd and look at the output. Then, copy/paste the code and edit to create:

+ a decision tree model for classification

+ that uses the `C5.0` engine.

Save it as `tree_mod` and look at the object. What is different about the output?

*Hint: you'll need https://www.tidymodels.org/find/parsnip/*

---

```r
lr_mod
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

tree_mod <- decision_tree() %>%
  set_engine(engine = "C5.0") %>%
  set_mode("classification")
tree_mod
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---
class: inverse, middle

## Now we've built a model.

## But, how do we .display[use] a model?

## First - what does it mean to use a model?

---
class: inverse, middle, center

![](https://media.giphy.com/media/fhAwk4DnqNgw8/giphy.gif)

Statistical models learn from the data.

Many learn model parameters, which *can* be useful as values for inference and interpretation.

---

## `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
fit(tree_mod, test ~ metascore + imdb_rating, data = bechdel)
```

---

## `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
tree_mod %>% # parsnip model
  fit(test ~ metascore + imdb_rating, # a formula
    data = bechdel # dataframe
  )
```

---

## `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
tree_fit <- tree_mod %>% # parsnip model
  fit(test ~ metascore + imdb_rating, # a formula
    data = bechdel # dataframe
  )
```

---

## A fitted model

```r
lr_mod %>%
  fit(test ~ metascore + imdb_rating,
    data = bechdel
  ) %>%
  broom::tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   2.70     0.436        6.20 5.64e-10
## 2 metascore     0.0202   0.00481      4.20 2.66e- 5
## 3 imdb_rating  -0.606    0.0889      -6.82 8.87e-12
```
]

<img src="index_files/figure-html/unnamed-chunk-33-1.png" width="80%" style="display: block; margin: auto;" />
]

---

## "All models are wrong, but some are useful"

]

```
##           Truth
## Prediction Fail Pass
##       Fail  613  421
##       Pass  159  201
```

]

---

## "All models are wrong, but some are useful"

]

```
##           Truth
## Prediction Fail Pass
##       Fail  583  397
##       Pass  189  225
```

]

---

## Axiom

The best way to measure a model's performance at predicting new data is to .display[predict new data].

---

# ♻️ Resample models with `rsample`

---

## `rsample`

---

## The holdout method

---

## `initial_split()*`

"Splits" data randomly into a single testing and a single training set.

```r
initial_split(data, prop = 3 / 4)
```

---

## `initial_split()`

```r
bechdel_split <- initial_split(data = bechdel, strata = test, prop = 3 / 4)
bechdel_split
## <Training/Testing/Total>
## <1045/349/1394>
```

---

## `training()` and `testing()*`

Extract training and testing sets from an `rsplit`

```r
training(bechdel_split)
testing(bechdel_split)
```

---

## `training()`

```r
bechdel_train <- training(bechdel_split)
bechdel_train
## # A tibble: 1,045 × 10
##     year title         test  budge…¹ domgr…² intgr…³ rated metas…⁴ imdb_…⁵ genre
##    <dbl> <chr>         <fct>   <dbl>   <dbl>   <dbl> <chr>   <dbl>   <dbl> <chr>
##  1  2013 12 Years a S… Fail     2       5.31   15.9  R          97     8.3 Biog…
##  2  2013 2 Guns        Fail     6.1     7.56   13.2  R          55     6.8 Acti…
##  3  2013 42            Fail     4       9.50    9.50 PG-13      62     7.6 Biog…
##  4  2013 47 Ronin      Fail    22.5     3.84   14.6  PG-13      29     6.6 Acti…
##  5  2013 A Good Day t… Fail     9.2     6.73   30.4  R          28     5.4 Acti…
##  6  2013 After Earth   Fail    13       6.05   24.4  PG-13      33     5   Acti…
##  7  2013 Cloudy with … Fail     7.8    12.0    27.2  PG         59     6.5 Anim…
##  8  2013 Don Jon       Fail     0.55    2.45    2.64 R          66     6.8 Come…
##  9  2013 Escape Plan   Fail     7       2.52   10.4  R          49     6.8 Acti…
## 10  2013 Gangster Squ… Fail     6       4.60   10.4  R          40     6.8 Acti…
## # … with 1,035 more rows, and abbreviated variable names ¹budget_2013,
## #   ²domgross_2013, ³intgross_2013, ⁴metascore, ⁵imdb_rating
```

---

## ⏱ Your turn 2

Fill in the blanks.

Use `initial_split()`, `training()`, and `testing()` to:

1. Split **bechdel** into training and test sets. Save the rsplit!

2. Extract the training data and fit your classification tree model.

3. Check the proportions of the `test` variable in each set.

Keep `set.seed(100)` at the start of your code.

---

```r
set.seed(100) # Important!

bechdel_split <- initial_split(bechdel, strata = test, prop = 3 / 4)
bechdel_train <- training(bechdel_split)
bechdel_test <- testing(bechdel_split)
```

---

## Data Splitting

---

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

## we can only use it once!

]

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

## How can we use the training set to compare, evaluate, and tune models?

]

---

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## V-fold cross-validation

```r
vfold_cv(data, v = 10, ...)
```

---
exclude: true

---

## Guess

How many times does an observation/row appear in the assessment set?

---

---

## Quiz

If we use 10 folds, which percent of our data will end up in the .display[analysis] set and which percent in the .display[assessment] set for each fold?

90% - analysis

10% - assessment

---

# Stratified sampling

---

## What if...

The assessment set looked like this?

---

## Or this?

---
<img src="index_files/figure-html/unnamed-chunk-83-1.png" width="80%" style="display: block; margin: auto;" />

---

<img src="index_files/figure-html/unnamed-chunk-84-1.png" width="80%" style="display: block; margin: auto;" />
]

```
##  test   n percent
##  Fail 772   55.4%
##  Pass 622   44.6%
```

## Resample

```
##  test Analysis Assessment
##  Fail   75.00%     25.00%
##  Pass        -          -
```

]

---
class: middle

.pull-left[
<img src="index_files/figure-html/unnamed-chunk-86-1.png" width="80%" style="display: block; margin: auto;" />
]

## Original

```
##  test   n percent
##  Fail 772   55.4%
##  Pass 622   44.6%
```

## Resample

```
##  test Analysis Assessment
##  Fail   75.00%     25.00%
##  Pass   74.92%     25.08%
```

]

---

## ⏱ Your Turn 3

Run the code below. What does it return?

```r
set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = test)
bechdel_folds
```

---

```r
set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = test)
bechdel_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits            id    
##    <list>            <chr> 
##  1 <split [940/105]> Fold01
##  2 <split [940/105]> Fold02
##  3 <split [940/105]> Fold03
##  4 <split [940/105]> Fold04
##  5 <split [940/105]> Fold05
##  6 <split [940/105]> Fold06
##  7 <split [941/104]> Fold07
##  8 <split [941/104]> Fold08
##  9 <split [941/104]> Fold09
## 10 <split [942/103]> Fold10
```

---

## `fit_resamples()`

Trains and tests a resampled model.

```r
tree_mod %>%
  fit_resamples(
    test ~ metascore + imdb_rating,
    resamples = bechdel_folds
  )
```

---

```r
tree_mod %>%
  fit_resamples(
    test ~ metascore + imdb_rating,
    resamples = bechdel_folds
  )
## # Resampling results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 4
##    splits            id     .metrics         .notes  
##    <list>            <chr>  <list>           <list>  
##  1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>
##  2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>
##  3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>
##  4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>
##  5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>
##  6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>
##  7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>
##  8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>
##  9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>
## 10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>
```

---

## `collect_metrics()`

Unnest the metrics column from a tidymodels `fit_resamples()`

```r
_results %>% collect_metrics(summarize = TRUE)
```

---

```r
tree_fit <- tree_mod %>% 
  fit_resamples(
    test ~ metascore + imdb_rating, 
    resamples = bechdel_folds
  )
collect_metrics(tree_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config           
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
## 1 accuracy binary     0.549    10  0.0121 Preprocessor1_Mod…
## 2 roc_auc  binary     0.559    10  0.0127 Preprocessor1_Mod…
```
]

```r
collect_metrics(tree_fit, summarize = FALSE)
## # A tibble: 20 × 5
##    id     .metric  .estimator .estimate .config             
##    <chr>  <chr>    <chr>          <dbl> <chr>               
##  1 Fold01 accuracy binary         0.610 Preprocessor1_Model1
##  2 Fold01 roc_auc  binary         0.625 Preprocessor1_Model1
##  3 Fold02 accuracy binary         0.610 Preprocessor1_Model1
##  4 Fold02 roc_auc  binary         0.621 Preprocessor1_Model1
##  5 Fold03 accuracy binary         0.562 Preprocessor1_Model1
##  6 Fold03 roc_auc  binary         0.562 Preprocessor1_Model1
##  7 Fold04 accuracy binary         0.552 Preprocessor1_Model1
##  8 Fold04 roc_auc  binary         0.535 Preprocessor1_Model1
##  9 Fold05 accuracy binary         0.495 Preprocessor1_Model1
## 10 Fold05 roc_auc  binary         0.502 Preprocessor1_Model1
## # … with 10 more rows
```

]

---

## 10-fold CV

- 10 different analysis/assessment sets

- 10 different models (trained on .display[analysis] sets)

- 10 different sets of performance statistics (on .display[assessment] sets)

---

## 📏 Evaluate models with `yardstick`

---

## `yardstick`

<https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics>

---

# `roc_curve()`

Takes predictions from a special kind of `fit_resamples()`.

Returns a tibble with probabilities.

```r
roc_curve(data, truth, ...)
```

`truth` = .display[actual] outcome

`...` = .display[predicted] probability of outcome

---

```r
tree_preds <- tree_mod %>% 
  fit_resamples(
    test ~ metascore + imdb_rating, 
    resamples = bechdel_folds,
*   control = control_resamples(save_pred = TRUE)
  )

tree_preds %>% 
  collect_predictions() %>% 
  roc_curve(truth = test, .pred_Fail)
## # A tibble: 36 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1   -Inf          0            1    
##  2      0.415      0            1    
##  3      0.420      0.0236       0.974
##  4      0.423      0.0558       0.946
##  5      0.425      0.0837       0.924
##  6      0.431      0.114        0.902
##  7      0.436      0.133        0.889
##  8      0.437      0.172        0.869
##  9      0.442      0.206        0.851
## 10      0.444      0.238        0.826
## # … with 26 more rows
```

---

## Area under the curve

.pull-left[
<img src="index_files/figure-html/unnamed-chunk-99-1.png" width="80%" style="display: block; margin: auto;" />
]

* AUC = 1: perfect classifier

* In general AUC of above 0.8 considered "good"
]

---

---

---

---

---

---

# ⏱ Your turn 4

Add an `autoplot()` to visualize the ROC AUC.

---

```r
tree_preds %>% 
  collect_predictions() %>% 
  roc_curve(truth = test, estimate = .pred_Fail) %>% 
  autoplot()
```