HW03: Wrangling and visualizing data
Overview
Due by 11:59pm on September 20th.
The goal of this assignment is to practice wrangling and exploring data in a research context.
Accessing the hw03
repository
Go here and find your copy of the hw03
repository. It follows the naming convention hw03-<USERNAME>
. Clone the repository to your computer.
Part 1: Tidying messy data
In the rcis
package, there is a data frame called dadmom
.
## # A tibble: 3 × 5
## famid named incd namem incm
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Bill 30000 Bess 15000
## 2 2 Art 22000 Amy 18000
## 3 3 Paul 25000 Pat 50000
Tidy this data frame so that it adheres to the tidy data principles:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
tidyr
functions. Code which does not use tidyr
functions is acceptable, but will not merit a “check plus” on your evaluation.Once you have tidied the data frame, generate a plot using the exact code below.
ggplot(data = dadmom_tidy, mapping = aes(x = parent, y = inc)) +
geom_point() +
geom_line(mapping = aes(group = famid)) +
scale_y_continuous(labels = scales::dollar) +
labs(
title = "Gender parity and household income",
subtitle = "Each line identifies a distinct family",
x = "Mom or Dad",
y = "Income",
) +
theme_minimal()
If you tidied the data frame correctly, then you will not have to make any changes to this code.
Part 2: Wrangling and visualizing messy(ish) data
The Supreme Court Database contains detailed information of every published decision of the U.S. Supreme Court since its creation in 1791. It is perhaps the most utilized database in the study of judicial politics.
In the hw03
repository, you will find two data files:
scdb-case.csv
scdb-vote.csv
These contain the exact same data you would obtain if you downloaded the files from the original website, but reformatted to be stored as relational data files. That is, scdb-case.csv
contains all case-level variables, whereas scdb-vote.csv
contains all vote-level variables.
The data is structured in a tidy fashion.
scdb-case.csv
contains one row for every case and one column for every variablescdb-vote.csv
contains one row for every vote by a justice in every case and one column for every variable
The current dataset contains information on every case decided from the 1791-2020 terms.1 There are several ID variables which can be used to join the data frames, specifically caseId
, docketId
, caseIssuesId
, and term
. Substantively all are irrelevant for the tasks below except for term
. Variables you will want to familiarize yourself with include:
chief
dateDecision
decisionDirection
decisionType
declarationUncon
direction
issueArea
justice
justiceName
majority
majVotes
minVotes
term
Once you import the data files, use your data wrangling and visualization skills to answer the following questions:
- What percentage of cases in each term are decided by a one-vote margin (i.e. 5-4, 4-3, etc.)?
- For justices currently serving on the Supreme Court, how often have they voted in the conservative direction in cases involving criminal procedure, civil rights, economic activity, and federal taxation?
- Organize the resulting graph by justice in descending order of seniority. Note that the chief justice is always considered the most senior member of the court, regardless of appointment date.
- In each term, how many of the term’s published decisions (decided after oral arguments) were announced in a given month?
- You may want to skim/read chapter 16 in R for Data Science as it discusses working with dates and times using the
lubridate
package - Let me emphasize: you want to skim/read chapter 16 in R for Data Science as it discusses working with dates and times using the
lubridate
package - Also note, the Supreme Court’s calendar runs on the federal government’s fiscal year. That means the first month of the court’s term is October, running through September of the following calendar year.
- You may want to skim/read chapter 16 in R for Data Science as it discusses working with dates and times using the
- Which justices are most likely to agree with with the Court’s declaration that an act of Congress, a state or territorial law, or a municipal ordinance is unconstitutional?
- Identify all cases where the Court declared something unconstitutional and determine the ten justices who most and least frequently agreed with this outcome as a percentage of all votes cast by the justice in these cases
- Exclude any justice with fewer than 30 votes in cases where the Court’s outcome declares something unconstitutional
- For each term he served on the Court, in what percentage of cases was Justice Antonin Scalia in the majority?
- Create a graph similar to #5 that compares the percentage for all cases versus non-unanimous cases (i.e. there was at least one dissenting vote)
- In each term, what percentage of cases were decided in the conservative direction?
- The Chief Justice is frequently seen as capable of influencing the ideological direction of the Court. Create a graph similar to #7 that also incorporates information on who was the Chief Justice during the term.
Submit the assignment
Your assignment should be submitted as two RMarkdown documents using the gfm
(GitHub Flavored Markdown) format. Follow instructions on homework workflow.
Rubric
Needs improvement: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. No record of commits other than the final push to GitHub.
Satisfactory: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.
Excellent: Finished all components of the assignment correctly and used efficient code to complete the exercises. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Use multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.
Terms run from October through June, so the 2020 term contains cases decided from October 2020 - June 2021. ↩︎