Published - Sat, 15 Apr 2023
Topic: Introduction to nycflights13 package and tidy data
Recall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. The package contains several data frames. Let's take a look at the flights data frame.
scss# Load nycflights13 package
library(nycflights13)
# View the flights data frame
View(flights)
We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.”
The nycflights13 package also contains other data frames with their rows representing different observational units:
The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame.
In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the gather() function from the tidyr package again.
Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala.
Let’s use the dem_score
data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala.
scss# Load required packages
library(dplyr)
library(tidyr)
library(ggplot2)
# Select only data corresponding to Guatemala
guat_dem <- dem_score %>%
filter(country == "Guatemala")
# View guat_dem
guat_dem
We can see that guat_dem is not in “tidy” format. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “key” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called democracy_score.
Our resulting data frame will thus have three columns: country, year, and democracy_score. Recall that the gather() function in the tidyr package can complete this task for us:
rguat_dem_tidy <- guat_dem %>%
gather(key = year, value = democracy_score, -country)
# View guat_dem_tidy
guat_dem_tidy
We set the arguments to gather() as follows:
Sat, 15 Apr 2023
Sat, 15 Apr 2023
Sat, 15 Apr 2023
Write a public review