Improving data communication

---

# Setup

---

## Setup

```r
# load packages
library(tidyverse)
library(scales)
library(colorblindr)
library(coloratio)

# set default theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8, fig.asp = 0.618, fig.retina = 2, dpi = 150, out.width = "60%"
)

# dplyr print min and max
options(dplyr.print_max = 6, dplyr.print_min = 6)
```

---

---

## Flatten the curve

- [Why outbreaks like coronavirus spread exponentially, and how to "flatten the curve"](https://www.washingtonpost.com/graphics/2020/world/corona-simulator/)

- [COVID-19 Dashboard](https://coronavirus.jhu.edu/map.html)

---

---

## Accessible COVID-19 statistics tracker

---

# Accessibility and screen readers

---

## Alternative text

> It is read by screen readers in place of images allowing the content and function of the image to be accessible to those with visual or certain cognitive disabilities.
>
> It is displayed in place of the image in browsers if the image file is not loaded or when the user has chosen not to view images.
>
>It provides a semantic meaning and description to images which can be read by search engines or be used to later determine the content of the image from page context alone.

---

## Alt and surrounding text

```
"CHART TYPE of TYPE OF DATA where REASON FOR INCLUDING CHART`

+ Link to data source somewhere in the text
```

--
- `CHART TYPE`: It's helpful for people with partial sight to know what chart type it is and gives context for understanding the rest of the visual.

--
- `TYPE OF DATA`: What data is included in the chart? The x and y axis labels may help you figure this out.

--
- `REASON FOR INCLUDING CHART`: Think about why you're including this visual. What does it show that’s meaningful. There should be a point to every visual and you should tell people what to look for.

--
- `Link to data source`: Don't include this in your alt text, but it should be included somewhere in the surrounding text.

.footnote[
Source: [Writing Alt Text for Data Visualization](https://medium.com/nightingale/writing-alt-text-for-data-visualization-2a218ef43f81)
]

---

## Data

- Registered nurses by state and year
- Number of nurses, salaries, employment
- Source: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-10-05)

```r
nurses <- read_csv("data/nurses.csv") %>% janitor::clean_names()
glimpse(nurses)
```

```
## Rows: 1,242
## Columns: 22
## $ state                                        <chr> "Alabama", "Alaska", "Ari…
## $ year                                         <dbl> 2020, 2020, 2020, 2020, 2…
## $ total_employed_rn                            <dbl> 48850, 6240, 55520, 25300…
## $ employed_standard_error_percent              <dbl> 2.9, 13.0, 3.7, 4.2, 2.0,…
## $ hourly_wage_avg                              <dbl> 28.96, 45.81, 38.64, 30.6…
## $ hourly_wage_median                           <dbl> 28.19, 45.23, 37.98, 29.9…
## $ annual_salary_avg                            <dbl> 60230, 95270, 80380, 6364…
## $ annual_salary_median                         <dbl> 58630, 94070, 79010, 6233…
## $ wage_salary_standard_error_percent           <dbl> 0.8, 1.4, 0.9, 1.4, 1.0, …
## $ hourly_10th_percentile                       <dbl> 20.75, 31.50, 27.66, 21.4…
## $ hourly_25th_percentile                       <dbl> 23.73, 36.94, 32.58, 25.7…
## $ hourly_75th_percentile                       <dbl> 33.15, 53.31, 44.67, 35.4…
## $ hourly_90th_percentile                       <dbl> 38.67, 60.70, 50.14, 39.6…
## $ annual_10th_percentile                       <dbl> 43150, 65530, 57530, 4466…
## $ annual_25th_percentile                       <dbl> 49360, 76830, 67760, 5349…
## $ annual_75th_percentile                       <dbl> 68960, 110890, 92920, 736…
## $ annual_90th_percentile                       <dbl> 80420, 126260, 104290, 82…
## $ location_quotient                            <dbl> 1.20, 0.98, 0.91, 1.00, 0…
## $ total_employed_national_aggregate            <dbl> 140019790, 140019790, 140…
## $ total_employed_healthcare_national_aggregate <dbl> 8632190, 8632190, 8632190…
## $ total_employed_healthcare_state_aggregate    <dbl> 128600, 17730, 171010, 80…
## $ yearly_total_employed_state_aggregate        <dbl> 1903210, 296300, 2835110,…
```
]

---

## Bar chart

.pull-left[
<img src="index_files/figure-html/nurses-bar-1.png" alt="The figure is a bar chart titled 'Total employed Registered Nurses' that   displays the numbers of registered nurses in three states (California, New York,   and North Carolina) over a 20 year period, with data recorded in three time points   (2000, 2010, and 2020). In each state, the numbers of registered nurses increase   over time. The following numbers are all approximate. California started off with   200K registered nurses in 2000, 240K in 2010, and 300K in 2020. New York had 150K   in 2000, 160K in 2010, and 170K in 2020. Finally North Carolina had 60K in 2000,   90K in 2010, and 100K in 2020." width="100%" style="display: block; margin: auto;" />
]
.pull-right[
- Provide the title and axis labels
- Briefly describe the chart and give a summary of any trends it displays
- Convert bar charts to accessible tables or lists
- Avoid describing visual attributes of the bars (e.g., dark blue, gray, yellow) unless there's an explicit need to do so 
]

---

## Developing the alt text

- Total employed registered nurses in three states over time.

--
- Total employed registered nurses in California, New York, and North Carolina, in 2000, 2010, and 2020.

--
- A bar chart of total employed registered nurses in California, New York, and North Carolina, in 2000, 2010, and 2020, showing increasing numbers of nurses over time.

--
- The figure is a bar chart titled 'Total employed Registered Nurses' that displays the numbers of registered nurses in three states (California, New York, and North Carolina) over a 20 year period, with data recorded in three time points (2000, 2010, and 2020). In each state, the numbers of registered nurses increase over time. The following numbers are all approximate. California started off with 200K registered nurses in 2000, 240K in 2010, and 300K in 2020.  New York had 150K in 2000, 160K in 2010, and 170K in 2020. Finally North Carolina had 60K in 2000, 90K in 2010, and 100K in 2020.

---

## Incorporating alt text in R Markdown

- Use the [`fig.alt` `knitr` chunk option](https://www.rstudio.com/blog/knitr-fig-alt/)

````default
```{r}
#| fig.alt = "The figure is a bar chart titled 'Total employed Registered Nurses' that
#|    displays the numbers of registered nurses in three states (California, New York,
#|    and North Carolina) over a 20 year period, with data recorded in three time points
#|    (2000, 2010, and 2020). In each state, the numbers of registered nurses increase
#|    over time. The following numbers are all approximate. California started off with
#|    200K registered nurses in 2000, 240K in 2010, and 300K in 2020. New York had 150K
#|    in 2000, 160K in 2010, and 170K in 2020. Finally North Carolina had 60K in 2000,
#|    90K in 2010, and 100K in 2020."

nurses_subset %>%
  filter(year %in% c(2000, 2010, 2020)) %>%
  ggplot(aes(x = state, y = total_employed_rn, fill = factor(year))) +
  geom_col(position = "dodge") +
  scale_fill_viridis_d(option = "E") +
  scale_y_continuous(labels = label_number(scale = 1/1000, suffix = "K")) +
  labs(
    x = "State", y = "Number of Registered Nurses", fill = "Year",
    title = "Total employed Registered Nurses"
  ) +
  theme(
    legend.background = element_rect(fill = "white", color = "white"),
    legend.position = c(0.85, 0.75)
    )
```
````

---

## Line graph

.pull-left[
<img src="index_files/figure-html/unnamed-chunk-8-1.png" alt="The figure is titled &quot;Annual median salary of Registered Nurses&quot;. There are three lines on the plot: the top labelled California, the middle New York, the bottom North Carolina. The vertical axis is labelled &quot;Annual median salary&quot;, beginning with $40K, up to $120K. The horizontal axis is labelled &quot;Year&quot;, beginning with couple years before 2000 up to 2020. The following numbers are all approximate. In the graph, the California line begins around $50K in 1998 and goes up to  $120K in 2020. The increase is steady, except for stalling for about couple years between 2015 to 2017. The New York line also starts around $50K, just below where the California line starts, and steadily goes up to $90K. And the North Carolina line starts around $40K and steadily goes up to $70K." width="100%" style="display: block; margin: auto;" />
]
.pull-right[
- Provide the title and axis labels
- Briefly describe the graph and give a summary of any trends it displays
- Convert data represented in lines to accessible tables or lists where feasible
- Avoid describing visual attributes of the bars (e.g., purple, pink) unless there's an explicit need to do so 
]

---

## Scatterplot

.pull-left[
<img src="index_files/figure-html/unnamed-chunk-9-1.png" alt="The figure is titled &quot;Median hourly wage of Registered Nurses&quot;. It is a scatter plot with points for each of the 50 U.S. states from 1998 to 2008. The horizontal axis is labeled &quot;Unemployment rate&quot;, beginning around 2% up to 14%. The horizontal axis is labelled &quot;Median hourly wage&quot;, beginning with amounts under $20 up to approximately $50. The pattern is hard to discern but appears to show a positive correlation between the variables. As unemployment rate increases the median hourly wage also slightly increases. There is more variability in median hourly wage for unemployment rates below 7%." width="100%" style="display: block; margin: auto;" />
]
.pull-right[
Scatter plots are among the more difficult graphs to describe, especially if there is a need to make specific data point accessible.

- Identify the image as a scatterplot
- Provide the title and axis labels
- Focus on the overall trend
- If it's necessary to be more specific, convert the data into an accessible table
]

---

## Recommended reading

[Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content](http://vis.csail.mit.edu/pubs/vis-text-model/)

Alan Lundgard, MIT CSAIL  
Arvind Satyanarayan, MIT CSAIL

IEEE Transactions on Visualization & Computer Graphics (Proceedings of IEEE VIS), 2021

>To demonstrate how our model can be applied to evaluate the effectiveness of visualization descriptions, we conduct a mixed-methods evaluation with 30 blind and 90 sighted readers, and find that these reader groups differ significantly on which semantic content they rank as most useful. Together, our model and findings suggest that access to meaningful information is strongly reader-specific, and that research in automatic visualization captioning should orient toward descriptions that more richly communicate overall trends and statistics, sensitive to reader preferences.

---

# Accessibility and colors

---

## Color scales

Use colorblind friendly color scales (e.g., Okabe Ito, viridis)

.panelset.sideways[
.panel[.panel-name[Code]

```r
nurses_subset %>%
  ggplot(aes(x = year, y = hourly_wage_median, color = state)) +
  geom_line(size = 2) +
* colorblindr::scale_color_OkabeIto() +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    x = "Year", y = "Median hourly wage", color = "State",
    title = "Median hourly wage of Registered Nurses"
  ) +
  theme(
    legend.position = c(0.15, 0.75),
    legend.background = element_rect(fill = "white", color = "white")
    )
```

]

]
]

---

## The default ggplot2 color scale

.panelset.sideways[
.panel[.panel-name[Original]
<img src="index_files/figure-html/default-ggplot2-1.png" width="100%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Vision-impaired]
<img src="index_files/figure-html/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" />
]
]

---

## Testing for colorblind friendliness

- Best way to test is with users (or collaborators) who have these color deficiencies

- `colorblindr::cvd_grid()`

- Simulation software also helps, e.g. Sim Daltonism for [Mac](https://michelf.ca/projects/sim-daltonism/) and [PC](https://pcmacstore.com/en/app/693112260/sim-daltonism)

---

## Color contrast

- Background and foreground text should have sufficient contrast to be distinguishable by users with different vision

- Web app for checking color contrast checking: [Color Contrast Analyser](https://www.tpgi.com/color-contrast-checker/)

- An WIP R package for checking for color contrast: [**coloratio**](https://matt-dray.github.io/coloratio)

```r
cr_get_ratio("black", "white")
```

```
## [1] 21
```

```r
cr_get_ratio("#FFFFFF", "#000000")
```

```
## [1] 21
```

```r
cr_get_ratio("black", "gray10")
```

```
## [1] 1.206596
```
]

---

## Double encoding

Use shape *and* color where possible

---

## Use direct labeling

- Prefer direct labeling where color is used to display information over a legend

- Quicker to read

- Ensures graph can be understood without reliance on color

---

## Without direct labeling

---

## With direct labeling

---

## Use whitespace or pattern to separate elements

- Separate elements with whitespace or pattern

- Allows for distinguishing between data without entirely relying on contrast between colors

---

## Without whitespace

---

## With whitespace

---

# Accessibility and fonts

---

## Accessibility and fonts

- Use a font that has been tested for accessibility (e.g., [Atkinson Hyperlegible](https://brailleinstitute.org/freefont))

--
- Keep plot labels and annotations similarly sized as the rest of your text (e.g., `ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))`)

---

## Accessibility and fonts

.panelset.sideways[
.panel[.panel-name[Code]

```r
*library(showtext)
*font_add_google(name = "Atkinson Hyperlegible")

nurses_subset %>%
  ggplot(aes(x = year, y = hourly_wage_median, color = state)) +
  geom_line(size = 2) +
  colorblindr::scale_color_OkabeIto() +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    x = "Year", y = "Median hourly wage", color = "State",
    title = "Median hourly wage of Registered Nurses"
  ) +
* theme_minimal(
*   base_size = 16,
*   base_family = "Atkinson Hyperlegible"
* )
```

]

]
]

---

.footnote[
Source: [A Comprehensive Guide to Accessible Data Visualization](https://www.betterment.com/resources/accessible-data-visualization/)
]

---

## Acknowledgements

- COVID visualization examples:
  - The New York Times. [Flattening the Coronavirus Curve](https://www.nytimes.com/article/flatten-curve-coronavirus.html)
  - The Washington Post. [Why outbreaks like coronavirus
spread exponentially, and how to "flatten the curve"](https://www.washingtonpost.com/graphics/2020/world/corona-simulator/)
  - [COVID-19 Dashboard
by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU)](https://coronavirus.jhu.edu/map.html)
  - T. Littlefield (2020) [COVID-19 Statistics Tracker](https://cvstats.net)

- Lundgard, Alan, and Arvind Satyanarayan. ["Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content."](https://ieeexplore.ieee.org/abstract/document/9555469) IEEE transactions on visualization and computer graphics (2021).

- [A Comprehensive Guide to Accessible Data Visualization](https://www.betterment.com/resources/accessible-data-visualization/)

- Silvia Canelón and Liz Hare. [Revealing Room for Improvement in Accessibility within a Social Media Data Visualization Learning Community](https://spcanelon.github.io/csvConf2021/slides/#1)