Published - Sat, 15 Apr 2023

Introduction to nycflights13 package and tidy data

Introduction to nycflights13 package and tidy data

Topic: Introduction to nycflights13 package and tidy data

Recall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. The package contains several data frames. Let's take a look at the flights data frame.

scss
# Load nycflights13 package library(nycflights13) # View the flights data frame View(flights)

We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.”

The nycflights13 package also contains other data frames with their rows representing different observational units:

  • airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.
  • planes: aircraft information about each of 3,322 planes used. i.e. the observational unit is an aircraft.
  • weather: hourly meteorological data (about 8705 observations) for each of the three NYC airports. i.e. the observational unit is an hourly measurement of weather at one of the three airports.
  • airports: airport names and locations. i.e. the observational unit is an airport.

The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame.

Case study: Democracy in Guatemala

In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the gather() function from the tidyr package again.

Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala.

Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala.

scss
# Load required packages library(dplyr) library(tidyr) library(ggplot2) # Select only data corresponding to Guatemala guat_dem <- dem_score %>% filter(country == "Guatemala") # View guat_dem guat_dem

We can see that guat_dem is not in “tidy” format. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “key” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called democracy_score.

Our resulting data frame will thus have three columns: country, year, and democracy_score. Recall that the gather() function in the tidyr package can complete this task for us:

r
guat_dem_tidy <- guat_dem %>% gather(key = year, value = democracy_score, -country) # View guat_dem_tidy guat_dem_tidy

We set the arguments to gather() as follows:

  • key is the name of the variable in the new data frame that will contain the column names of the original data frame

Comments (0)

Search
Popular categories
Latest blogs
Tidyverse Package
Tidyverse Package
Tidyverse PackageThe tidyverse package is a collection of packages for data science in R. It includes some of the most frequently used packages, such as dplyr, ggplot2, readr, and tidyr. By installing and loading the tidyverse package, you can load multiple packages at once.InstallationTo install the tidyverse package, use the following command:RCopy codeinstall.packages("tidyverse") Loading PackagesTo load the tidyverse package, use the following command:RCopy codelibrary(tidyverse) This is equivalent to loading the following packages individually:RCopy codelibrary(ggplot2) library(dplyr) library(tidyr) library(readr) library(purrr) library(tibble) library(stringr) library(forcats) Common Inputs and OutputsAll functions in the tidyverse packages are designed to have common inputs and outputs, which are data frames in "tidy" format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible.For example, the following code demonstrates the use of dplyr and ggplot2 functions on a data frame:RCopy codelibrary(tidyverse) # create a sample data frame data <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6)) # use dplyr to filter data filtered_data <- data %>% filter(x > 1) # use ggplot2 to create a plot ggplot(filtered_data, aes(x, y)) + geom_point() This code filters the data frame to include only rows where x is greater than 1 and then creates a scatter plot of the remaining data using ggplot2.By using the tidyverse package, you can easily move between different packages and functions, making it a powerful tool for data science in R.Lecture Notes: Introduction to the Tidyverse PackageOverviewThe tidyverse package is an "umbrella" package that installs and loads multiple packages at once for you.The tidyverse package includes some of the most frequently used R packages for data science.The tidyverse package is designed to standardize input and output data frames, making transitions between different functions in the different packages as seamless as possible.Installing and Loading the Tidyverse PackageInstall the tidyverse package using install.packages("tidyverse").Load the tidyverse package using library(tidyverse).scssCopy code# Load individual packages library(dplyr) library(ggplot2) library(readr) library(tidyr) # Load tidyverse package library(tidyverse) Common Packages in the Tidyverse Packageggplot2 for data visualizationdplyr for data wranglingtidyr for converting data to “tidy” formatreadr for importing spreadsheet data into RUsing the Tidyverse PackageUse library(tidyverse) to load all the packages in the tidyverse package.Functions in the tidyverse package have common inputs and outputs: data frames are in "tidy" format.For more information, check out the tidyverse.org webpage for the package.scssCopy code# Load tidyverse package library(tidyverse) # Example of using functions in the tidyverse package data(mpg) mpg %>% filter(class == "subcompact") %>% ggplot() + aes(x = displ, y = hwy, color = manufacturer) + geom_point() ConclusionThe tidyverse package is an essential package for data science in R.It includes multiple packages for data visualization, data wrangling, importing data, and converting data to "tidy" format.Loading the tidyverse package is quicker than loading individual packages.The standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible.

Sat, 15 Apr 2023

Introduction to nycflights13 package and tidy data
Introduction to nycflights13 package and tidy data
Topic: Introduction to nycflights13 package and tidy dataRecall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. The package contains several data frames. Let's take a look at the flights data frame.scssCopy code# Load nycflights13 package library(nycflights13) # View the flights data frame View(flights) We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.”The nycflights13 package also contains other data frames with their rows representing different observational units:airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.planes: aircraft information about each of 3,322 planes used. i.e. the observational unit is an aircraft.weather: hourly meteorological data (about 8705 observations) for each of the three NYC airports. i.e. the observational unit is an hourly measurement of weather at one of the three airports.airports: airport names and locations. i.e. the observational unit is an airport.The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame.Case study: Democracy in GuatemalaIn this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the gather() function from the tidyr package again.Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala.Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala.scssCopy code# Load required packages library(dplyr) library(tidyr) library(ggplot2) # Select only data corresponding to Guatemala guat_dem <- dem_score %>% filter(country == "Guatemala") # View guat_dem guat_dem We can see that guat_dem is not in “tidy” format. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “key” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called democracy_score.Our resulting data frame will thus have three columns: country, year, and democracy_score. Recall that the gather() function in the tidyr package can complete this task for us:rCopy codeguat_dem_tidy <- guat_dem %>% gather(key = year, value = democracy_score, -country) # View guat_dem_tidy guat_dem_tidy We set the arguments to gather() as follows:key is the name of the variable in the new data frame that will contain the column names of the original data frame

Sat, 15 Apr 2023

Working with Airline Safety Data
Working with Airline Safety Data
Title: Working with Airline Safety DataIntroduction: In this lecture, we will learn how to work with the airline_safety data frame included in the fivethirtyeight data package. We will explore the data, clean it up, and convert it into a tidy format using R programming language.Step 1: Load the dataset We start by loading the airline_safety dataset using the following command:scssCopy codelibrary(fivethirtyeight) data("airline_safety") Step 2: Exploring the dataset We can use the head() and summary() functions to get a quick overview of the dataset.scssCopy codehead(airline_safety) summary(airline_safety) Step 3: Cleaning the dataset We will remove the incl_reg_subsidiaries and avail_seat_km_per_week columns from the dataset using the select() function from the dplyr package.scssCopy codelibrary(dplyr) airline_safety_smaller <- airline_safety %>% select(-c(incl_reg_subsidiaries, avail_seat_km_per_week)) Step 4: Converting to Tidy Format The current format of the data frame is not tidy. We can convert it to tidy format using the tidyr package.scssCopy codelibrary(tidyr) airline_safety_tidy <- airline_safety_smaller %>% pivot_longer( cols = c( incidents_85_99, fatal_accidents_85_99, fatalities_85_99, incidents_00_14, fatal_accidents_00_14, fatalities_00_14 ), names_to = "incident_type_years", values_to = "count" ) Step 5: Viewing the Tidy Dataset We can use the head() function to view the first few rows of the tidy dataset.scssCopy codehead(airline_safety_tidy) Conclusion: In this lecture, we learned how to work with the airline_safety data frame using R programming language. We explored the dataset, cleaned it up, and converted it to tidy format. The resulting dataset is easier to work with and can be used for further analysis.

Sat, 15 Apr 2023

All blogs