2.2 Crafts
We will focus on six data wrangling functions in the dplyr
package:
filter()
: extracts rows (observations) of a data frame based on logical vectors.select()
: extracts columns (variables) of a data frame based on column names.arrange()
: orders rows of a data frame based on column names.summarise()
: collapses a data frame into summary statistics based on summary functions (e.g., statistics functions) specified with column names.mutate()
: creates new variables and adds them to the existing columns based on window functions (e.g., transforming operations) specified with column names.group_by()
: assigns rows into groups within a data frame based on column names.
The very first argument in all these functions is a data frame, followed by logical vectors, column names, or other kinds of items. This allows for applying these functions sequentially through the first argument; for example, func3(func2(func1(data,...), ...), ...)
. This can be rewritten as data %>% func1(...) %>% func2(...) %>% func3(...)
via a pipe operator %>%
.
This lets us express a sequence of data wrangling operations in plain English. Specifically, we read %>%
as then, so that the above example becomes; start with the data, then apply func1(...)
, then apply func2(...)
, then func3(..)
.
Say, we want to find the average of delays in departures and arrivals from New York to the St. Paul-Minneapolis airport (MSP). We can construct the following sequence of instructions: start with the flight data frame, apply filter()
to extract the rows of flights to MSP, and then apply summarise()
to calculate the mean.
flights %>% # data frame "flights", then
filter(dest == "MSP") %>% # filter rows, then
summarise(
# summarise departure and arrival delays for their means
# and call them mean_dep_delay and mean_arr_delay respectively
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
mean_arr_delay = mean(arr_delay, na.rm = TRUE)
) # calculate the mean, while removing NA values
## # A tibble: 1 x 2
## mean_dep_delay mean_arr_delay
## <dbl> <dbl>
## 1 13.3 7.27
In summarise()
, one can use summary functions that maps a vector to a scaler. Examples include functions like mean()
, sd()
(standard deviation), quantile()
, min()
, max()
, and n()
(observation count in the dplyr
package).
Each time we apply the %>%
operator above, we pass a modified data frame from one data wrangling operation to another through the first argument. The above code is equivalent to
summarise( # data frame "flights" is inside filter(), which is inside summarise()
filter(flights, dest == "MSP"),
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
mean_arr_delay = mean(arr_delay, na.rm = TRUE)
)
## # A tibble: 1 x 2
## mean_dep_delay mean_arr_delay
## <dbl> <dbl>
## 1 13.3 7.27
You will quickly discover that %>%
operator makes the code much easier to read, write, and edit and how that inspires you to explore the data more.
Let’s add a few more lines to the previous example. Say, now we want to see the average delay by carrier and sort the results by the number of observations (e.g. flights) in descending order.
Okay, what do we do? We make a sequence of data wrangling operations in plain English and translate that into code by replacing then with pipe operator %>%
.
For example, try thinking this way; “start with the data frame flights
; then (%>%
) filter()
to extract the rows of flights to MSP; then group rows by carrier; then summarise()
data for the number of observations and the means; then arrange()
the results by the observation count in descending order.”
flight_stats_MSP <- flights %>% # assign the results to an object named "flight_stats"
filter(dest == "MSP") %>%
group_by(carrier) %>% # group rows by carrier
summarise(
n_obs = n(), # count number of rows
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
mean_arr_delay = mean(arr_delay, na.rm = TRUE)
) %>%
arrange(desc(n_obs)) # sort by n_obs in descending order
flight_stats_MSP # show flight_stats object
## # A tibble: 6 x 4
## carrier n_obs mean_dep_delay mean_arr_delay
## <chr> <int> <dbl> <dbl>
## 1 DL 2864 10.7 4.04
## 2 EV 1773 17.1 10.5
## 3 MQ 1293 8.26 9.56
## 4 9E 1249 19.7 8.09
## 5 OO 4 0.750 -2.00
## 6 UA 2 -6.00 -5.50
Tip #1: Frame tasks in a sequence and use pipe operator %>% accordingly.
The carrier variable is expressed in the International Air Transportation Association (IATA) code, so let’s add a column of carrier names by joining another data frame called airlines
. In RStudio, you can find this data frame under the Environment tab (in the upper right corner); switch the display option from Global Environment to package:nycflights13. To inspect the data frame, type View(airlines)
in the R console. Also, by typing data()
you can see a list of all datasets that are loaded with libraries.
# left_join(a,b, by="var") joins two data frames a and b by matching rows of b to a
# by identifier variable "var".
# kable() prints a better-looking table here
left_join(flight_stats_MSP, airlines, by="carrier") %>%
kable(digits=2)
carrier | n_obs | mean_dep_delay | mean_arr_delay | name |
---|---|---|---|---|
DL | 2864 | 10.65 | 4.04 | Delta Air Lines Inc. |
EV | 1773 | 17.09 | 10.53 | ExpressJet Airlines Inc. |
MQ | 1293 | 8.26 | 9.56 | Envoy Air |
9E | 1249 | 19.66 | 8.09 | Endeavor Air Inc. |
OO | 4 | 0.75 | -2.00 | SkyWest Airlines Inc. |
UA | 2 | -6.00 | -5.50 | United Air Lines Inc. |
In the next example, we add new variables to flights
using mutate()
.
flights %>%
# keep only columns named "dep_delay" and "arr_delay"
select(dep_delay, arr_delay) %>%
mutate(
gain = arr_delay - dep_delay,
gain_rank = round(percent_rank(gain), digits = 2)
# Note: we can immediately use the "gain" variable we just defined.
)
## # A tibble: 336,776 x 4
## dep_delay arr_delay gain gain_rank
## <dbl> <dbl> <dbl> <dbl>
## 1 2. 11. 9. 0.810
## 2 4. 20. 16. 0.880
## 3 2. 33. 31. 0.940
## 4 -1. -18. -17. 0.220
## 5 -6. -25. -19. 0.180
## 6 -4. 12. 16. 0.880
## 7 -5. 19. 24. 0.920
## 8 -3. -14. -11. 0.370
## 9 -3. -8. -5. 0.540
## 10 -2. 8. 10. 0.820
## # ... with 336,766 more rows
We extracted specific columns of flights
by select()
and added new columns defined in mutate()
. mutate()
differs from summarise()
in that mutate()
adds new columns to the data frame, while summarise()
collapses the data frame into a summary table.
There are roughly five types of window functions that are commonly used inside mutate()
:
- summary functions, which are interpreted as a vector of repeated values (e.g., a column of an identical mean value)
- ranking or ordering functions (e.g.,
row_number()
,min_rank()
,dense_rank()
,cume_dist()
,percent_rank()
, andntile()
)
- ranking or ordering functions (e.g.,
- offset functions, say defining a lagged variable in time series data (
lead()
andlag()
)
- offset functions, say defining a lagged variable in time series data (
- cumulative aggregates (e.g.,
cumsum()
,cummin()
,cummax()
,cumall()
,cumany()
, andcummean()
)
- cumulative aggregates (e.g.,
- fixed-window rolling aggregates such as a windowed mean, median, etc.
To look up documentations of these function, for example, type?cumsum
.
- fixed-window rolling aggregates such as a windowed mean, median, etc.
Tip #2: In the beginning you might find it confusing to choose between summarise()
and mutate()
. Just remeber mutate()
is for creating new variables or overwriting existing variables.
By no means, the use of summarise()
and mutate()
is not restricted to these functions listed above. You can define your own function and apply inside summarise()
or mutate()
. Let’s quickly go over what a function is in R and how you can use a custom function in the tidyverse syntax.
In R, we use function()
to define a function, which consists of a function name, input arguments separated by comma, and a body containing tasks to be performed (multiple expressions are bundled by brackets { }
, and the last expression is returned as an output).
your_function_name <- function(input_arg1, input_arg2) {
task1
task2
.
.
.
output_to_return
}
For a function having only a single expression to execute, we can omit brackets { }
.
another_function <- function(input args) task_and_output_in_a_single_expression
Let’s go through a few examples.
vec1 <- c(1:10, NA, NA)
vec1
## [1] 1 2 3 4 5 6 7 8 9 10 NA NA
my_mean <- function(x, na.rm=TRUE) mean(x, na.rm = na.rm)
# sets the default of "na.rm" argument to be TRUE.
mean(vec1) # returns NA
## [1] NA
my_mean(vec1)
## [1] 5.5
my_mean(vec1, na.rm=FALSE) # returns NA
## [1] NA
my_zscore <- function(x, remove_na=TRUE, digits=2) {
zscore <- (x - my_mean(x, na.rm = remove_na))/sd(x, na.rm = remove_na)
round(zscore, digits=digits)
}
# calculates a z-score of vector x
my_zscore(vec1)
## [1] -1.49 -1.16 -0.83 -0.50 -0.17 0.17 0.50 0.83 1.16 1.49 NA
## [12] NA
Let’s try using functions my_mean()
and my_zscore()
in summarise()
and mutate()
.
flights %>%
select(dep_delay) %>%
summarise(
mean_dep_delay_na = mean(dep_delay), # returns NA
mean_dep_delay_1 = mean(dep_delay, na.rm = TRUE), # passing argument na.rm = TRUE
mean_dep_delay_2 = my_mean(dep_delay) # using my_mean()
) %>%
kable(digits=2)
mean_dep_delay_na | mean_dep_delay_1 | mean_dep_delay_2 |
---|---|---|
NA | 12.64 | 12.64 |
flights_gain <- flights %>%
select(dep_delay, arr_delay) %>%
mutate(
gain = arr_delay - dep_delay,
gain_mean = my_mean(gain), # returns the same mean for all rows
gain_z2 = my_zscore(gain) # using my_zscore()
)
head(flights_gain) %>% # show the first several rows
kable(digits=2)
dep_delay | arr_delay | gain | gain_mean | gain_z2 |
---|---|---|---|---|
2 | 11 | 9 | -5.66 | 0.81 |
4 | 20 | 16 | -5.66 | 1.20 |
2 | 33 | 31 | -5.66 | 2.03 |
-1 | -18 | -17 | -5.66 | -0.63 |
-6 | -25 | -19 | -5.66 | -0.74 |
-4 | 12 | 16 | -5.66 | 1.20 |
summarise_all()
and mutate_all()
apply summary functions like mean()
and sd()
to all columns in a data frame. But, we can also use any custom function that returns appropriate output (i.e., a scalar output for summarise_all()
and a vector output for mutate_all()
). For example, here are several ways to calculate the means when the data frame contains missing values.
# Calculation fails due to NA values
flights_gain %>%
select(dep_delay, arr_delay, gain) %>%
summarise_all(mean) %>%
kable(digits=2)
dep_delay | arr_delay | gain |
---|---|---|
NA | NA | NA |
# filter rows that contain any NA value
flights_gain %>%
select(dep_delay, arr_delay, gain) %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
summarise_all(mean) %>%
kable(digits=2)
dep_delay | arr_delay | gain |
---|---|---|
12.56 | 6.9 | -5.66 |
# pass argument na.rm=TRUE to mean()
flights_gain %>%
select(dep_delay, arr_delay, gain) %>%
summarise_all(funs(mean), na.rm=TRUE) %>%
kable(digits=2)
dep_delay | arr_delay | gain |
---|---|---|
12.64 | 6.9 | -5.66 |
# use a custom function with default na.rm=TRUE
flights_gain %>%
select(dep_delay, arr_delay, gain) %>%
summarise_all(funs(my_mean)) %>%
kable(digits=2)
dep_delay | arr_delay | gain |
---|---|---|
12.64 | 6.9 | -5.66 |
To practice the tydiverse syntax above, start with data extraction by combining filter()
, select()
, and arrange()
. Then, try summarising data via summarise()
and creating new variables via mutate()
, followed by adding additional layers of tasks such as grouping statistics via group_by()
and filtering data by filter()
.
Exercise
Try the following exercises using flights
data.
- Find the set of all origin-carrier combinations whose destination is MSP. Hint: use
filter()
,select()
, andunique()
. (Ans.: 10 unique combinations.) - Sort that by origin. Hint: use
arrange()
.
- Find the mean and standard deviation of air time from New York to MSP. Hint: use
filter()
andsummarise()
. - Find the average arrival delay as a percentage of air time from New York to MSP. Hint: use
filter()
,mutate()
, andsummarise()
. (Ans.: 4.50%.) - Do the above calculation by carrier. Hint: use
filter()
,mutate()
,group_by()
, andsummarise()
. (Ans.: ranges from -3.57% to 4.97%.) - Do the above calculation by origin and carrier and sort the results by origin.
- Do that in z-score using the grand mean and grand s.d. to find any origin-carrier combination exhibits statistically significant arrival delay (in percentage of air time). Hint: do the following all inside
mutate()
, define a percentage arrival delay variable, define variables for its mean and s.d., and define a z-statistic using those three variables. (Ans.: ranges from -0.118 to 1.584.)
Take time to do these exercises since we will be using the tydiverse syntax in next section.