Topic: Introduction to nycflights13 package and tidy data
Recall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. The package contains several data frames. Let's take a look at the flights data frame.
# Load nycflights13 package
# View the flights data frame
We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.”
The nycflights13 package also contains other data frames with their rows representing different observational units:
- airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.
- planes: aircraft information about each of 3,322 planes used. i.e. the observational unit is an aircraft.
- weather: hourly meteorological data (about 8705 observations) for each of the three NYC airports. i.e. the observational unit is an hourly measurement of weather at one of the three airports.
- airports: airport names and locations. i.e. the observational unit is an airport.
The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame.
Case study: Democracy in Guatemala
In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the gather() function from the tidyr package again.
Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala.
Let’s use the
dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala.
# Load required packages
# Select only data corresponding to Guatemala
guat_dem <- dem_score %>%
filter(country == "Guatemala")
# View guat_dem
We can see that guat_dem is not in “tidy” format. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “key” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called democracy_score.
Our resulting data frame will thus have three columns: country, year, and democracy_score. Recall that the gather() function in the tidyr package can complete this task for us:
guat_dem_tidy <- guat_dem %>%
gather(key = year, value = democracy_score, -country)
We set the arguments to gather() as follows:
- key is the name of the variable in the new data frame that will contain the column names of the original data frame