2.2 Crafts

We will focus on six data wrangling functions in the dplyr package:

  • filter(): extracts rows (observations) of a data frame based on logical vectors.

  • select(): extracts columns (variables) of a data frame based on column names.

  • arrange(): orders rows of a data frame based on column names.

  • summarise(): collapses a data frame into summary statistics based on summary functions (e.g., statistics functions) specified with column names.

  • mutate(): creates new variables and adds them to the existing columns based on window functions (e.g., transforming operations) specified with column names.

  • group_by(): assigns rows into groups within a data frame based on column names.

The very first argument in all these functions is a data frame, followed by logical vectors, column names, or other kinds of items. This allows for applying these functions sequentially through the first argument; for example, func3(func2(func1(data,...), ...), ...). This can be rewritten as data %>% func1(...) %>% func2(...) %>% func3(...) via a pipe operator %>%.
This lets us express a sequence of data wrangling operations in plain English. Specifically, we read %>% as then, so that the above example becomes; start with the data, then apply func1(...), then apply func2(...), then func3(..).

Say, we want to find the average of delays in departures and arrivals from New York to the St. Paul-Minneapolis airport (MSP). We can construct the following sequence of instructions: start with the flight data frame, apply filter() to extract the rows of flights to MSP, and then apply summarise() to calculate the mean.

flights %>%  # data frame "flights", then
  filter(dest == "MSP") %>%  # filter rows, then  
  summarise(   
    # summarise departure and arrival delays for their means 
    # and call them mean_dep_delay and mean_arr_delay respectively
    mean_dep_delay = mean(dep_delay, na.rm = TRUE), 
    mean_arr_delay = mean(arr_delay, na.rm = TRUE) 
    )    # calculate the mean, while removing NA values  
## # A tibble: 1 x 2
##   mean_dep_delay mean_arr_delay
##            <dbl>          <dbl>
## 1           13.3           7.27

In summarise(), one can use summary functions that maps a vector to a scaler. Examples include functions like mean(), sd() (standard deviation), quantile(), min(), max(), and n() (observation count in the dplyr package).

Each time we apply the %>% operator above, we pass a modified data frame from one data wrangling operation to another through the first argument. The above code is equivalent to

summarise(   # data frame "flights" is inside filter(), which is inside summarise() 
    filter(flights, dest == "MSP"), 
    mean_dep_delay = mean(dep_delay, na.rm = TRUE),
    mean_arr_delay = mean(arr_delay, na.rm = TRUE)
    )
## # A tibble: 1 x 2
##   mean_dep_delay mean_arr_delay
##            <dbl>          <dbl>
## 1           13.3           7.27

You will quickly discover that %>% operator makes the code much easier to read, write, and edit and how that inspires you to explore the data more.

Let’s add a few more lines to the previous example. Say, now we want to see the average delay by carrier and sort the results by the number of observations (e.g. flights) in descending order.

Okay, what do we do? We make a sequence of data wrangling operations in plain English and translate that into code by replacing then with pipe operator %>%.
For example, try thinking this way; “start with the data frame flights; then (%>%) filter() to extract the rows of flights to MSP; then group rows by carrier; then summarise() data for the number of observations and the means; then arrange() the results by the observation count in descending order.”

flight_stats_MSP <- flights %>%  # assign the results to an object named "flight_stats"
  filter(dest == "MSP") %>% 
  group_by(carrier) %>%  #  group rows by carrier 
  summarise(
    n_obs = n(),  # count number of rows 
    mean_dep_delay = mean(dep_delay, na.rm = TRUE),
    mean_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  arrange(desc(n_obs))  # sort by n_obs in descending order

flight_stats_MSP  # show flight_stats object
## # A tibble: 6 x 4
##   carrier n_obs mean_dep_delay mean_arr_delay
##   <chr>   <int>          <dbl>          <dbl>
## 1 DL       2864         10.7             4.04
## 2 EV       1773         17.1            10.5 
## 3 MQ       1293          8.26            9.56
## 4 9E       1249         19.7             8.09
## 5 OO          4          0.750          -2.00
## 6 UA          2         -6.00           -5.50

Tip #1: Frame tasks in a sequence and use pipe operator %>% accordingly.

The carrier variable is expressed in the International Air Transportation Association (IATA) code, so let’s add a column of carrier names by joining another data frame called airlines. In RStudio, you can find this data frame under the Environment tab (in the upper right corner); switch the display option from Global Environment to package:nycflights13. To inspect the data frame, type View(airlines) in the R console. Also, by typing data() you can see a list of all datasets that are loaded with libraries.

# left_join(a,b, by="var") joins two data frames a and b by matching rows of b to a 
  # by identifier variable "var".
# kable() prints a better-looking table here
left_join(flight_stats_MSP, airlines, by="carrier") %>%
  kable(digits=2) 
carrier n_obs mean_dep_delay mean_arr_delay name
DL 2864 10.65 4.04 Delta Air Lines Inc.
EV 1773 17.09 10.53 ExpressJet Airlines Inc.
MQ 1293 8.26 9.56 Envoy Air
9E 1249 19.66 8.09 Endeavor Air Inc.
OO 4 0.75 -2.00 SkyWest Airlines Inc.
UA 2 -6.00 -5.50 United Air Lines Inc.

In the next example, we add new variables to flights using mutate().

flights %>%
  # keep only columns named "dep_delay" and "arr_delay"
  select(dep_delay, arr_delay) %>% 
  mutate(
    gain = arr_delay - dep_delay,
    gain_rank = round(percent_rank(gain), digits = 2)
      # Note: we can immediately use the "gain" variable we just defined. 
  )
## # A tibble: 336,776 x 4
##    dep_delay arr_delay  gain gain_rank
##        <dbl>     <dbl> <dbl>     <dbl>
##  1        2.       11.    9.     0.810
##  2        4.       20.   16.     0.880
##  3        2.       33.   31.     0.940
##  4       -1.      -18.  -17.     0.220
##  5       -6.      -25.  -19.     0.180
##  6       -4.       12.   16.     0.880
##  7       -5.       19.   24.     0.920
##  8       -3.      -14.  -11.     0.370
##  9       -3.       -8.   -5.     0.540
## 10       -2.        8.   10.     0.820
## # ... with 336,766 more rows

We extracted specific columns of flights by select() and added new columns defined in mutate(). mutate() differs from summarise() in that mutate() adds new columns to the data frame, while summarise() collapses the data frame into a summary table.

There are roughly five types of window functions that are commonly used inside mutate():

    1. summary functions, which are interpreted as a vector of repeated values (e.g., a column of an identical mean value)
    1. ranking or ordering functions (e.g., row_number(), min_rank(), dense_rank(), cume_dist(), percent_rank(), and ntile())
    1. offset functions, say defining a lagged variable in time series data (lead() and lag())
    1. cumulative aggregates (e.g., cumsum(), cummin(), cummax(), cumall(), cumany(), and cummean())
    1. fixed-window rolling aggregates such as a windowed mean, median, etc.
      To look up documentations of these function, for example, type ?cumsum.

Tip #2: In the beginning you might find it confusing to choose between summarise() and mutate(). Just remeber mutate() is for creating new variables or overwriting existing variables.

By no means, the use of summarise() and mutate() is not restricted to these functions listed above. You can define your own function and apply inside summarise() or mutate(). Let’s quickly go over what a function is in R and how you can use a custom function in the tidyverse syntax.

In R, we use function() to define a function, which consists of a function name, input arguments separated by comma, and a body containing tasks to be performed (multiple expressions are bundled by brackets { }, and the last expression is returned as an output).

your_function_name <- function(input_arg1, input_arg2) {
                        task1
                        task2
                        .
                        .
                        .
                        output_to_return 
                      } 

For a function having only a single expression to execute, we can omit brackets { }.

another_function <- function(input args) task_and_output_in_a_single_expression                    

Let’s go through a few examples.

vec1 <- c(1:10, NA, NA)
vec1
##  [1]  1  2  3  4  5  6  7  8  9 10 NA NA
my_mean <- function(x, na.rm=TRUE)  mean(x, na.rm = na.rm)   
# sets the default of "na.rm" argument to be TRUE. 

mean(vec1) # returns NA
## [1] NA
my_mean(vec1)
## [1] 5.5
my_mean(vec1, na.rm=FALSE)  # returns NA
## [1] NA
my_zscore <- function(x, remove_na=TRUE, digits=2) { 
  zscore <- (x - my_mean(x, na.rm = remove_na))/sd(x, na.rm = remove_na)  
  round(zscore, digits=digits)
}
  # calculates a z-score of vector x

my_zscore(vec1)
##  [1] -1.49 -1.16 -0.83 -0.50 -0.17  0.17  0.50  0.83  1.16  1.49    NA
## [12]    NA

Let’s try using functions my_mean() and my_zscore() in summarise() and mutate().

flights %>% 
  select(dep_delay) %>% 
  summarise(
    mean_dep_delay_na = mean(dep_delay),  # returns NA
    mean_dep_delay_1 = mean(dep_delay, na.rm = TRUE), # passing argument na.rm = TRUE
    mean_dep_delay_2 = my_mean(dep_delay)  # using my_mean()  
  ) %>%
  kable(digits=2)
mean_dep_delay_na mean_dep_delay_1 mean_dep_delay_2
NA 12.64 12.64
flights_gain <- flights %>%
  select(dep_delay, arr_delay) %>% 
  mutate(
    gain = arr_delay - dep_delay,
    gain_mean = my_mean(gain),  # returns the same mean for all rows
    gain_z2 = my_zscore(gain)  # using my_zscore()   
  )

head(flights_gain) %>%  # show the first several rows
  kable(digits=2)
dep_delay arr_delay gain gain_mean gain_z2
2 11 9 -5.66 0.81
4 20 16 -5.66 1.20
2 33 31 -5.66 2.03
-1 -18 -17 -5.66 -0.63
-6 -25 -19 -5.66 -0.74
-4 12 16 -5.66 1.20

summarise_all() and mutate_all() apply summary functions like mean() and sd() to all columns in a data frame. But, we can also use any custom function that returns appropriate output (i.e., a scalar output for summarise_all() and a vector output for mutate_all()). For example, here are several ways to calculate the means when the data frame contains missing values.

# Calculation fails due to NA values 
flights_gain %>% 
  select(dep_delay, arr_delay, gain)  %>%
  summarise_all(mean) %>%  
  kable(digits=2) 
dep_delay arr_delay gain
NA NA NA
# filter rows that contain any NA value 
flights_gain %>% 
  select(dep_delay, arr_delay, gain)  %>%
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%  
  summarise_all(mean) %>%  
  kable(digits=2) 
dep_delay arr_delay gain
12.56 6.9 -5.66
# pass argument na.rm=TRUE to mean()
flights_gain %>% 
  select(dep_delay, arr_delay, gain) %>%
  summarise_all(funs(mean), na.rm=TRUE) %>%
    kable(digits=2)
dep_delay arr_delay gain
12.64 6.9 -5.66
# use a custom function with default na.rm=TRUE
flights_gain %>% 
  select(dep_delay, arr_delay, gain) %>%
  summarise_all(funs(my_mean)) %>%
    kable(digits=2)
dep_delay arr_delay gain
12.64 6.9 -5.66

To practice the tydiverse syntax above, start with data extraction by combining filter(), select(), and arrange(). Then, try summarising data via summarise() and creating new variables via mutate(), followed by adding additional layers of tasks such as grouping statistics via group_by() and filtering data by filter().

Exercise

Try the following exercises using flights data.

  • Find the set of all origin-carrier combinations whose destination is MSP. Hint: use filter(), select(), and unique(). (Ans.: 10 unique combinations.)
  • Sort that by origin. Hint: use arrange().
  • Find the mean and standard deviation of air time from New York to MSP. Hint: use filter() and summarise().
  • Find the average arrival delay as a percentage of air time from New York to MSP. Hint: use filter(), mutate(), and summarise(). (Ans.: 4.50%.)
  • Do the above calculation by carrier. Hint: use filter(), mutate(), group_by(), and summarise(). (Ans.: ranges from -3.57% to 4.97%.)
  • Do the above calculation by origin and carrier and sort the results by origin.
  • Do that in z-score using the grand mean and grand s.d. to find any origin-carrier combination exhibits statistically significant arrival delay (in percentage of air time). Hint: do the following all inside mutate(), define a percentage arrival delay variable, define variables for its mean and s.d., and define a z-statistic using those three variables. (Ans.: ranges from -0.118 to 1.584.)

Take time to do these exercises since we will be using the tydiverse syntax in next section.