Created by - Robert Kotaki
Data Importing & “Tidy” DataIntroduction to Tidy DataData frame: a rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation.In R, the term "tidy" data refers to a set of rules by which data is saved, outlining a standard way of formatting data.Having data stored in "tidy" format is essential for using the tools covered up until now.Tidy data is useful for data visualization, data wrangling, regression, and statistical inference.Needed Packagesdplyr: provides a grammar of data manipulation, providing functions for filtering, selecting, arranging, grouping, and joining data frames.ggplot2: a powerful graphics package for creating visualizations.readr: provides a fast way to read CSV, TSV, and fixed-width files into R.tidyr: provides functions for creating and modifying tidy data frames.nycflights13: a package containing data frames related to flights from New York City airports.fivethirtyeight: a package containing datasets used by the FiveThirtyEight website.Importing DataThree common formats for spreadsheet data: CSV, Excel, and Google Sheets.Comma Separated Values (CSV) file is a bare-bones spreadsheet where each line in the file corresponds to one row of data/one observation.Excel .xlsx files contain metadata, or data about data.Google Sheets is a "cloud" or online-based way to work with a spreadsheet.Two methods for importing .csv and .xlsx spreadsheet data in R: using the console and using RStudio's graphical user interface (GUI).Using the ConsoleUse the read_csv() function from the readr package to read a .csv file from the internet, import it into R, and save it in a data frame.Example:RCopy codelibrary(readr) dem_score <- read_csv("https://moderndive.com/data/dem_score.csv") dem_score Using RStudio's Graphical User Interface (GUI)Click on "Import Dataset" in the Environment tab.Select the file you want to import.Choose the appropriate settings for the file format and click "Import".Example:File > Import Dataset > From Text (base) > Select file > Click "Import" > Choose settings > Click "Import".ConclusionIn this chapter, we learned how to import spreadsheet data in R.Tidy data is a standard way of formatting data that is essential for using the tools covered up until now.Using the console or RStudio's graphical user interface, we can import data from CSV, Excel, and Google Sheets files.
More detailsPublished - Sat, 15 Apr 2023
Created by - Robert Kotaki
Tidy DataIntroductionTidy data is a standard way of organizing data where each variable is a column, each observation is a row, and each type of observational unit forms a table.Tidy data allows for easy data analysis, visualization, and sharing.Example: FiveThirtyEight PackageThe fivethirtyeight package provides access to datasets used in many articles published by data journalism website FiveThirtyEight.com.We will focus on the drinks dataset, which contains the average number of servings of beer, spirits, and wine consumed in 193 countries.The objective is to create a side-by-side barplot to compare alcohol consumption in four countries: the United States, China, Italy, and Saudi Arabia.Data Wrangling VerbsWe will use data wrangling verbs to transform the drinks dataset into a smaller, tidy dataset called drinks_smaller.filter(): select only the four countries of interest.select(): remove the total_litres_of_pure_alcohol column.rename(): rename the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively.The resulting dataset drinks_smaller is a tidy dataset.RCopy codedrinks_smaller <- drinks %>% filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>% select(-total_litres_of_pure_alcohol) %>% rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings) Creating a Side-by-Side BarplotWe need to transform the drinks_smaller dataset into a tidy format where each row represents a unique combination of country, alcohol type, and servings.We can use the pivot_longer() function to reshape the data into a tidy format.RCopy codedrinks_smaller_tidy <- drinks_smaller %>% pivot_longer(cols = c(beer, spirit, wine), names_to = "type", values_to = "servings") We can now use the ggplot() function to create a side-by-side barplot.We need to map the categorical variable country to the x-position of the bars, the numerical variable servings to the y-position of the bars, and the categorical variable type to the fill color of the bars.RCopy codeggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) + geom_bar(stat = "identity", position = position_dodge()) + labs(title = "Comparing Alcohol Consumption in 4 Countries", x = "Country", y = "Servings") The resulting plot is a tidy visualization of the data, where each bar represents a unique combination of country, alcohol type, and servings.
More detailsPublished - Sat, 15 Apr 2023
Created by - Robert Kotaki
Tidy DataI. IntroductionTidy data means that your data follows a standardized format.The definition of tidy data was introduced by Hadley Wickham in 2014.Tidy data is organized in a way that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.II. Variables and ObservationsA dataset is a collection of values that are usually either numbers (if quantitative) or strings (if qualitative/categorical).Every value belongs to a variable and an observation.A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes.III. Tidy Data FormatIn tidy data, each variable forms a column, each observation forms a row, and each type of observational unit forms a table.Tidy data is organized in a rectangular format, like a spreadsheet.Tidy data is easier to work with and analyze because the data is standardized and consistently structured.IV. ExamplesThe example of non-tidy data is a table of stock prices.In non-tidy data, although there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns.In tidy data format, each variable should be its own column, which is demonstrated in Table 4.2.The example of tidy data is a table that shows the price of Boeing stock and the weather on a particular day.Even though the variable “Boeing Price” occurs just like in non-tidy data, the data is tidy since there are three variables corresponding to three unique pieces of information: Date, Boeing stock price, and the weather that particular day.V. Code ExamplesLet's create a tidy data format from a non-tidy data format using R programming language.We will use the tidyr package to transform the data into a tidy format.bashCopy code# Load the tidyr package library(tidyr) # Create a data frame with non-tidy data stock_prices <- data.frame(Date = c("2009-01-01", "2009-01-02"), Boeing_stock_price = c("$173.55", "$172.61"), Amazon_stock_price = c("$174.90", "$171.42"), Google_stock_price = c("$174.34", "$170.04")) # Use the gather function to transform the data into a tidy format stock_prices_tidy <- gather(stock_prices, Stock_name, Stock_price, -Date) # Print the tidy data stock_prices_tidy The output should be a tidy data format:javascriptCopy code Date Stock_name Stock_price 1 2009-01-01 Boeing $173.55 2 2009-01-02 Boeing $172.61 3 2009-01-01 Amazon $174.90 4 2009-01-02 Amazon $171.42 5 2009-01-01 Google $174.34 6 2009-01-02 Google $170.04 VI. ConclusionTidy data is a standard way of mapping the meaning of a dataset to its structure.Tidy data format is organized in a way that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.Using R programming language and the tidyr package, we can transform non-tidy data into a tidy format.
More detailsPublished - Sat, 15 Apr 2023
Created by - Robert Kotaki
Converting to "tidy" dataIn data analysis, it is important to have data in a "tidy" format to easily work with it using packages such as ggplot2 and dplyr. In this lecture, we will discuss how to convert data from a "wide" format to a "tidy" format using the gather() function from the tidyr package.What is "tidy" data?Tidy data is a format where each variable has its own column, each observation has its own row, and each value has its own cell.Tidy data makes it easy to work with data and perform operations such as filtering, summarizing, and visualizing.Converting to "tidy" data using gather()If the original data frame is in a "wide" format, we can use the gather() function to convert it to a "tidy" format.The gather() function has three main arguments:key: the name of the variable in the new "tidy" data frame that will contain the column names of the original data.value: the name of the variable in the new "tidy" data frame that will contain the rows and columns of values of the original data.columns: the columns you either want to or don't want to tidy.To use gather(), we can pipe the original data frame into the function and specify the arguments.For example, if we have the following "wide" format data frame:Copy codecountry beer spirit wine China 79 192 8 Italy 85 42 237 Saudi Arabia 0 5 0 USA 249 158 84 We can convert it to a "tidy" format using gather() as follows:scssCopy codedrinks_smaller_tidy <- drinks_smaller %>% gather(key = type, value = servings, -country) In this case, we set the key to "type" since we want the column type to contain the three types of alcohol: beer, spirit, and wine. We set the value to "servings" since we want the column value to contain the numerical values of the original data. We set the columns to -country to indicate that we don't want to tidy the country variable in drinks_smaller and rather only beer, spirit, and wine.Visualizing "tidy" dataOnce we have converted our data to a "tidy" format, we can easily visualize it using ggplot2.For example, we can create a barplot comparing alcohol consumption in different countries using the following code:scssCopy codeggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) + geom_col(position = "dodge") In this case, we use geom_col() since we want to map the "pre-counted" servings variable to the y-aesthetic of the bars.ConclusionConverting "wide" format data to "tidy" format can be confusing for new R users.The gather() function from the tidyr package is a powerful tool for converting data to a "tidy" format.Practicing with examples and looking at documentation can help in mastering this skill.
More detailsPublished - Sat, 15 Apr 2023
Created by - Robert Kotaki
Title: Working with Airline Safety DataIntroduction: In this lecture, we will learn how to work with the airline_safety data frame included in the fivethirtyeight data package. We will explore the data, clean it up, and convert it into a tidy format using R programming language.Step 1: Load the dataset We start by loading the airline_safety dataset using the following command:scssCopy codelibrary(fivethirtyeight) data("airline_safety") Step 2: Exploring the dataset We can use the head() and summary() functions to get a quick overview of the dataset.scssCopy codehead(airline_safety) summary(airline_safety) Step 3: Cleaning the dataset We will remove the incl_reg_subsidiaries and avail_seat_km_per_week columns from the dataset using the select() function from the dplyr package.scssCopy codelibrary(dplyr) airline_safety_smaller <- airline_safety %>% select(-c(incl_reg_subsidiaries, avail_seat_km_per_week)) Step 4: Converting to Tidy Format The current format of the data frame is not tidy. We can convert it to tidy format using the tidyr package.scssCopy codelibrary(tidyr) airline_safety_tidy <- airline_safety_smaller %>% pivot_longer( cols = c( incidents_85_99, fatal_accidents_85_99, fatalities_85_99, incidents_00_14, fatal_accidents_00_14, fatalities_00_14 ), names_to = "incident_type_years", values_to = "count" ) Step 5: Viewing the Tidy Dataset We can use the head() function to view the first few rows of the tidy dataset.scssCopy codehead(airline_safety_tidy) Conclusion: In this lecture, we learned how to work with the airline_safety data frame using R programming language. We explored the dataset, cleaned it up, and converted it to tidy format. The resulting dataset is easier to work with and can be used for further analysis.
More detailsPublished - Sat, 15 Apr 2023
Created by - Robert Kotaki
Topic: Introduction to nycflights13 package and tidy dataRecall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. The package contains several data frames. Let's take a look at the flights data frame.scssCopy code# Load nycflights13 package library(nycflights13) # View the flights data frame View(flights) We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.”The nycflights13 package also contains other data frames with their rows representing different observational units:airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.planes: aircraft information about each of 3,322 planes used. i.e. the observational unit is an aircraft.weather: hourly meteorological data (about 8705 observations) for each of the three NYC airports. i.e. the observational unit is an hourly measurement of weather at one of the three airports.airports: airport names and locations. i.e. the observational unit is an airport.The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table i.e. data frame.Case study: Democracy in GuatemalaIn this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (in other words is in “wide” format) to a data frame that is in “tidy” format (in other words is in “long/narrow” format). We’ll do this using the gather() function from the tidyr package again.Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala.Let’s use the dem_score data frame we imported in Section 4.1, but focus on only data corresponding to Guatemala.scssCopy code# Load required packages library(dplyr) library(tidyr) library(ggplot2) # Select only data corresponding to Guatemala guat_dem <- dem_score %>% filter(country == "Guatemala") # View guat_dem guat_dem We can see that guat_dem is not in “tidy” format. We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “key” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “value” variable called democracy_score.Our resulting data frame will thus have three columns: country, year, and democracy_score. Recall that the gather() function in the tidyr package can complete this task for us:rCopy codeguat_dem_tidy <- guat_dem %>% gather(key = year, value = democracy_score, -country) # View guat_dem_tidy guat_dem_tidy We set the arguments to gather() as follows:key is the name of the variable in the new data frame that will contain the column names of the original data frame
More detailsPublished - Sat, 15 Apr 2023
Created by - Robert Kotaki
Tidyverse PackageThe tidyverse package is a collection of packages for data science in R. It includes some of the most frequently used packages, such as dplyr, ggplot2, readr, and tidyr. By installing and loading the tidyverse package, you can load multiple packages at once.InstallationTo install the tidyverse package, use the following command:RCopy codeinstall.packages("tidyverse") Loading PackagesTo load the tidyverse package, use the following command:RCopy codelibrary(tidyverse) This is equivalent to loading the following packages individually:RCopy codelibrary(ggplot2) library(dplyr) library(tidyr) library(readr) library(purrr) library(tibble) library(stringr) library(forcats) Common Inputs and OutputsAll functions in the tidyverse packages are designed to have common inputs and outputs, which are data frames in "tidy" format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible.For example, the following code demonstrates the use of dplyr and ggplot2 functions on a data frame:RCopy codelibrary(tidyverse) # create a sample data frame data <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6)) # use dplyr to filter data filtered_data <- data %>% filter(x > 1) # use ggplot2 to create a plot ggplot(filtered_data, aes(x, y)) + geom_point() This code filters the data frame to include only rows where x is greater than 1 and then creates a scatter plot of the remaining data using ggplot2.By using the tidyverse package, you can easily move between different packages and functions, making it a powerful tool for data science in R.Lecture Notes: Introduction to the Tidyverse PackageOverviewThe tidyverse package is an "umbrella" package that installs and loads multiple packages at once for you.The tidyverse package includes some of the most frequently used R packages for data science.The tidyverse package is designed to standardize input and output data frames, making transitions between different functions in the different packages as seamless as possible.Installing and Loading the Tidyverse PackageInstall the tidyverse package using install.packages("tidyverse").Load the tidyverse package using library(tidyverse).scssCopy code# Load individual packages library(dplyr) library(ggplot2) library(readr) library(tidyr) # Load tidyverse package library(tidyverse) Common Packages in the Tidyverse Packageggplot2 for data visualizationdplyr for data wranglingtidyr for converting data to “tidy” formatreadr for importing spreadsheet data into RUsing the Tidyverse PackageUse library(tidyverse) to load all the packages in the tidyverse package.Functions in the tidyverse package have common inputs and outputs: data frames are in "tidy" format.For more information, check out the tidyverse.org webpage for the package.scssCopy code# Load tidyverse package library(tidyverse) # Example of using functions in the tidyverse package data(mpg) mpg %>% filter(class == "subcompact") %>% ggplot() + aes(x = displ, y = hwy, color = manufacturer) + geom_point() ConclusionThe tidyverse package is an essential package for data science in R.It includes multiple packages for data visualization, data wrangling, importing data, and converting data to "tidy" format.Loading the tidyverse package is quicker than loading individual packages.The standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible.
More detailsPublished - Sat, 15 Apr 2023
Sat, 15 Apr 2023
Sat, 15 Apr 2023
Sat, 15 Apr 2023
Write a public review