Practice getting data from the Twitter API
library(tidyverse)
library(rtweet)
set.seed(1234)
theme_set(theme_minimal())
Run the code below in your console to download this exercise as a set of R scripts.
usethis::use_course("cis-ds/getting-data-from-the-web-api-access")
There are several packages for R for accessing and searching Twitter. Twitter actually has two separate APIs:
- The REST API - this allows you programmatic access to read and write Twitter data. For research purposes, this allows you to search the recent history of tweets and look up specific users.
- The Streaming API - this allows you to access the public data flowing through Twitter in real-time. It requires your R session to be running continuously, but allows you to capture a much larger sample of tweets while avoiding rate limits for the REST API.
Using rtweet
Here, we are going to practice using the rtweet
package to search Twitter.
library(rtweet)
OAuth authentication
All you need is a Twitter account (user name and password) and you can be up in running in minutes!
Simply send a request to Twitter’s API (with a function like search_tweets()
, get_timeline()
, get_followers()
, get_favorites()
, etc.) during an interactive session of R, authorize the embedded rstats2twitter
app (approve the browser popup), and your token will be created and saved/stored (for future sessions) for you!
Searching tweets
To find 3000 recent tweets using the “rstats” hashtag:
rt <- search_tweets(
q = "#rstats",
n = 3000,
include_rts = FALSE
)
rt
## # A tibble: 3,000 × 90
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 4263007693 1496929184542867462 2022-02-24 19:25:03 gp_pulipaka "Top 1… Buffer
## 2 4263007693 1496711514069295107 2022-02-24 05:00:07 gp_pulipaka "6 Fre… Buffer
## 3 4263007693 1496696497039200256 2022-02-24 04:00:26 gp_pulipaka "A Lis… Buffer
## 4 4263007693 1496707222461718530 2022-02-24 04:43:03 gp_pulipaka "AI Be… Buffer
## 5 4263007693 1496695646891483145 2022-02-24 03:57:04 gp_pulipaka "#AI B… Buffer
## 6 4263007693 1496710741445025796 2022-02-24 04:57:02 gp_pulipaka "#AI B… Buffer
## 7 4263007693 1496172703681859584 2022-02-22 17:19:04 gp_pulipaka "The C… Buffer
## 8 4263007693 1496340562055778305 2022-02-23 04:26:05 gp_pulipaka "Autom… Buffer
## 9 4263007693 1496161379690106894 2022-02-22 16:34:04 gp_pulipaka "The C… Buffer
## 10 4263007693 1496303058275508228 2022-02-23 01:57:03 gp_pulipaka "#AI B… Buffer
## # … with 2,990 more rows, and 84 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>, …
q
- the search queryn
- maximum number of tweets to be returnedinclude_rts = FALSE
- exclude retweets generated by Twitter’s built-in “retweet” function. We only want original tweets.
The resulting object is a tibble
data frame with one row for each tweet. The data frame contains the full text of the tweet (text
), the username of the poster (screen_name
), as well as a wealth of metadata.
Note that the Twitter REST API limits all searches to the past 6-9 days. You will not retrieve any earlier results.
Searching users
Use get_timeline()
to retrieve tweets from one or more specified Twitter users. This only works for users with public profiles or those that have authorized your app.
countvoncount <- get_timeline(user = "countvoncount", n = 4000)
countvoncount
## # A tibble: 3,250 × 43
## created_at id id_str full_text truncated display_text_ra…
## <dttm> <dbl> <chr> <chr> <lgl> <dbl>
## 1 2022-07-10 13:21:59 1.55e18 15461980612… Three th… FALSE 39
## 2 2022-07-09 13:21:57 1.55e18 15458356669… Three th… FALSE 50
## 3 2022-07-08 15:21:56 1.55e18 15455034720… Three th… FALSE 38
## 4 2022-07-08 09:21:55 1.55e18 15454128729… Three th… FALSE 38
## 5 2022-07-07 15:21:54 1.55e18 15451410766… Three th… FALSE 34
## 6 2022-07-06 12:21:52 1.54e18 15447333824… Three th… FALSE 39
## 7 2022-07-06 08:21:52 1.54e18 15446729831… Three th… FALSE 50
## 8 2022-07-05 17:21:51 1.54e18 15444464861… Three th… FALSE 40
## 9 2022-07-05 08:21:50 1.54e18 15443105881… Three th… FALSE 38
## 10 2022-07-04 19:21:49 1.54e18 15441142913… Three th… FALSE 39
## # … with 3,240 more rows, and 37 more variables: entities <list>, source <chr>,
## # in_reply_to_status_id <lgl>, in_reply_to_status_id_str <lgl>,
## # in_reply_to_user_id <lgl>, in_reply_to_user_id_str <lgl>,
## # in_reply_to_screen_name <lgl>, geo <list>, coordinates <list>,
## # place <list>, contributors <lgl>, is_quote_status <lgl>,
## # retweet_count <int>, favorite_count <int>, favorited <lgl>,
## # retweeted <lgl>, lang <chr>, possibly_sensitive <list>, …
With get_timeline()
, you are not limited to only the most recent 6-9 days of tweets.
Visualizing tweets
Because the resulting objects are data frames, you can perform standard data transformation, summarization, and visualization on the underlying data.
rtweet
includes the ts_plot()
function which automates some common time series visualization methods. For example, we can quickly visualize the frequency of @countvoncount
tweets:
ts_plot(countvoncount, by = "1 week")
The by
argument allows us to aggregate over different lengths of time.
ts_plot(countvoncount, by = "1 month")
And because ts_plot()
uses ggplot2
, we can modify the graphs using familiar ggplot2
functions:
ts_plot(countvoncount, by = "1 week") +
theme(plot.title = element_text(face = "bold")) +
labs(
x = NULL, y = NULL,
title = "Frequency of @countvoncount Twitter posts",
subtitle = "Twitter status (tweet) counts aggregated using one week intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
Exercise: Practice using rtweet
Find the 1000 most recent tweets by Katy Perry, Kim Kardashian, and Rihanna.
Visualize their tweet frequency by week. Who posts most often? Who posts least often?
Click for the solution
katy_perry <- get_timeline( user = "katyperry", n = 1000 ) kim_kardashian <- get_timeline( user = "KimKardashian", n = 1000 ) rihanna <- get_timeline( user = "rihanna", n = 1000 ) # combine, group by character, and plot weekly tweet frequency bind_rows( `Katy Perry` = katy_perry %>% select(created_at), `Kim Kardashian` = kim_kardashian %>% select(created_at), `Rihanna` = rihanna %>% select(created_at), .id = "screen_name" ) %>% group_by(screen_name) %>% ts_plot(by = "months")
Acknowledgments
- This page is derived in part from “UBC STAT 545A and 547M”, licensed under the CC BY-NC 3.0 Creative Commons License.
- OAuth token storage derived from “Obtaining and using access tokens”.
Session Info
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os macOS Monterey 12.3
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2022-08-22
## pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.2.0)
## backports 1.4.1 2021-12-13 [2] CRAN (R 4.2.0)
## blogdown 1.10 2022-05-10 [2] CRAN (R 4.2.0)
## bookdown 0.27 2022-06-14 [2] CRAN (R 4.2.0)
## broom 1.0.0 2022-07-01 [2] CRAN (R 4.2.0)
## bslib 0.4.0 2022-07-16 [2] CRAN (R 4.2.0)
## cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.2.0)
## cli 3.3.0 2022-04-25 [2] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.2.0)
## crayon 1.5.1 2022-03-26 [2] CRAN (R 4.2.0)
## DBI 1.1.3 2022-06-18 [2] CRAN (R 4.2.0)
## dbplyr 2.2.1 2022-06-27 [2] CRAN (R 4.2.0)
## digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [2] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
## evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.1)
## fansi 1.0.3 2022-03-24 [2] CRAN (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [2] CRAN (R 4.2.0)
## fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
## gargle 1.2.0 2021-07-02 [2] CRAN (R 4.2.0)
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [2] CRAN (R 4.2.0)
## glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
## googledrive 2.0.0 2021-07-08 [2] CRAN (R 4.2.0)
## googlesheets4 1.0.0 2021-07-21 [2] CRAN (R 4.2.0)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.2.0)
## haven 2.5.0 2022-04-15 [2] CRAN (R 4.2.0)
## here 1.0.1 2020-12-13 [2] CRAN (R 4.2.0)
## hms 1.1.1 2021-09-26 [2] CRAN (R 4.2.0)
## htmltools 0.5.3 2022-07-18 [2] CRAN (R 4.2.0)
## httr 1.4.3 2022-05-04 [2] CRAN (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [2] CRAN (R 4.2.0)
## knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [2] CRAN (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.0)
## modelr 0.1.8 2020-05-19 [2] CRAN (R 4.2.0)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.2.0)
## pillar 1.8.0 2022-07-18 [2] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [2] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
## readr * 2.1.2 2022-01-30 [2] CRAN (R 4.2.0)
## readxl 1.4.0 2022-03-28 [2] CRAN (R 4.2.0)
## reprex 2.0.1.9000 2022-08-10 [1] Github (tidyverse/reprex@6d3ad07)
## rlang 1.0.4 2022-07-12 [2] CRAN (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [2] CRAN (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.2.0)
## rtweet * 1.0.2.9005 2022-08-15 [1] Github (ropensci/rtweet@39eecff)
## rvest 1.0.2 2021-10-16 [2] CRAN (R 4.2.0)
## sass 0.4.2 2022-07-16 [2] CRAN (R 4.2.0)
## scales 1.2.0 2022-04-13 [2] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
## stringi 1.7.8 2022-07-11 [2] CRAN (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.2.0)
## tibble * 3.1.8 2022-07-22 [2] CRAN (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [2] CRAN (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.2.0)
## tidyverse * 1.3.2 2022-07-18 [2] CRAN (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [2] CRAN (R 4.2.0)
## utf8 1.2.2 2021-07-24 [2] CRAN (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.2.0)
## withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
## xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
## xml2 1.3.3 2021-11-30 [2] CRAN (R 4.2.0)
## yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
##
## [1] /Users/soltoffbc/Library/R/arm64/4.2/library
## [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────