2.4 More examples 1
The rest of the section provides more examples of dplyr
and ggplot2
functions in the context of continued exploration of the flight data.
Let’s see if there are distinct patters of departure delays over the course of a year. We can do this by taking the average of departure delays for each day by flight origin and plot the data as a time series using line-graphs.
delay_day <- flights %>%
group_by(origin, year, month, day) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
mutate(date = as.Date(paste(year, month, day), format="%Y %m %d")) %>%
filter(!is.na(dep_delay)) # exclude rows with dep_delay == NA
delay_day %>% # "facet_grid( var ~ .)" is similar to "facet_wrap( ~ var)"
ggplot(aes(x = date, y = dep_delay)) + geom_line() + facet_grid( origin ~ . )
Seasonal patterns seem similar across airports, and summer months appear to be busier on average. Let’s see how closely these patterns compare across the three line-graphs (EWR, JFK, and LGA) in summer months.
delay_day %>%
filter("2013-07-01" <= date, "2013-08-31" >= date) %>%
ggplot(aes(x = date, y = dep_delay, color = origin)) + geom_line()
We can see similar patterns of spikes across airports occurring on certain days, indicating a tendency for the three airports to get busy on the same days. Would this mean that the three airports tend to be congested at the same time?
In the previous figure, there seems to be some cyclical pattern of delays. A good place to start would be comparing delays by day of the week. Here is a function to calculate day of the week for a given date.
# Input: date in the format as in "2017-01-23"
# Output: day of week
my_dow <- function(date) {
# as.POSIXlt(date)[['wday']] returns integers 0, 1, 2, .. 6, for Sun, Mon, ... Sat.
# We extract one item from a vector (Sun, Mon, ..., Sat) by position numbered from 1 to 7.
dow <- as.POSIXlt(date)[['wday']] + 1
c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")[dow] # extract "dow"-th element
}
Sys.Date() # Sys.Date() returns the current date
## [1] "2018-06-22"
my_dow(Sys.Date())
## [1] "Fri"
Now, let’s take a look at the mean delay by day of the week using boxplots.
delay_day <- flights %>%
group_by(year, month, day) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
mutate(date = as.Date(paste(year, month, day), format="%Y %m %d"),
# date defined by as.Data() function
dow = my_dow(date),
weekend = dow %in% c("Sat", "Sun")
# %in% operator: A %in% B returns TRUE/FALSE for whether each element of A is in B.
)
# show the first 10 elements of "dow" variable in "delay_day" data frame
delay_day$dow[1:10]
## [1] "Tue" "Wed" "Thu" "Fri" "Sat" "Sun" "Mon" "Tue" "Wed" "Thu"
delay_day <- delay_day %>% mutate(
# add a sorting order (Mon, Tue, ..., Sun) and overwrite dow
dow = ordered(dow,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
)
delay_day$dow[1:10]
## [1] Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
delay_day %>%
filter(!is.na(dep_delay)) %>%
ggplot(aes(x = dow, y = dep_delay, fill = weekend)) + geom_boxplot()
It appears that delays are on average longer on Thursdays and Fridays and shorter on Saturdays. This is plausible if more people are traveling on Thursdays and Fridays before the weekend, and less are traveling on Saturdays to enjoy the weekend. Are Saturdays really less busy? Let’s find out.
flights_dow <- flights %>%
mutate(date = as.Date(paste(year, month, day), format="%Y %m %d"),
dow = ordered(my_dow(date),
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
weekend = dow %in% c("Sat", "Sun")
)
# count flight numbers by
flights_dow %>%
group_by(dow) %>%
summarise( nobs = n() )
## # A tibble: 7 x 2
## dow nobs
## <ord> <int>
## 1 Mon 50690
## 2 Tue 50422
## 3 Wed 50060
## 4 Thu 50219
## 5 Fri 50308
## 6 Sat 38720
## 7 Sun 46357
# visualize this as a bar plot
flights_dow %>%
ggplot(aes(x = dow)) + geom_bar()
Yes, Saturdays are less busy for the airports in terms of flight numbers.
Could we generalize this positive relationship between the number of flights and the average delays, which we find across days of the week? To explore this, we can summarize the data into the average delays by date-hour and see if the busyness of a particular hour of a particular day is correlated with the mean delay. Let’s visualize these data using a scatter plot.
delay_day_hr <- flights %>%
group_by(year, month, day, hour) %>% # grouping by date-hour
summarise(
n_obs = n(),
dep_delay = mean(dep_delay, na.rm = TRUE)
) %>%
mutate(date = as.Date(paste(year, month, day), format="%Y %m %d"),
dow = my_dow(date)
) %>% ungroup() # it's a good practice to remove group_by() attribute
plot_delay <- delay_day_hr %>%
filter(!is.na(dep_delay)) %>%
ggplot(aes(x = n_obs, y = dep_delay)) + geom_point(alpha = 0.1)
# plot of n_obs against the average dep_delay
# where each point represents an date-hour average
# "alpha = 0.1" controls the degree of transparency of points
plot_delay
Along the horizontal axis, we can see how the number of flights is distributed across date-hours. Some days are busy, and some hours busier still. It appears that there are two clusters in the number of flights, showing very slow date-hours (e.g., less than 10 flights flying out of New York city per hour) and normal date-hours (e.g., about 50 to 70 flights per hour). We might guess that the delays in the slow hours are caused by bad weather. On the other hand, for normal hours we may wonder if the excess delays are caused by congestion at the airports. To see this, let’s fit a curve that captures the relationships between n_obs
and dep_delay
. Our hypothesis is that the delay would become longer as the number of flights increases, which would result in an upward-sloped curve.
plot_delay +
geom_smooth() # geom_smooth() addes a layer of fitted curve(s)
## `geom_smooth()` using method = 'gam'
We cannot see any clear pattern. How about fitting a curve by day of the week?
plot_delay +
# additional aes() argument for applying different colors to the day of the week
geom_smooth(aes(color = dow), se=FALSE)
## `geom_smooth()` using method = 'gam'
Surprisingly, the delay does not seem to increase with the flights. There are more delays on Thursdays and Fridays and less delays on Saturdays, but we see no evidence of flight congestion as a cause of delay.
Let’s take a closer look at the distribution of the delays. If it is not normally distributed, we may want to apply a transformation.
delay_day_hr %>% filter(!is.na(dep_delay)) %>%
ggplot(aes(x = dep_delay)) + geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of the average delays are greatly skewed.
To apply a logarithmic transformation, here we have to shift the variable by setting its minimum value zero.
# define new column called "dep_delay_shifted"
delay_day_hr <- delay_day_hr %>%
mutate(dep_delay_shifted = dep_delay - min(dep_delay, na.rm = TRUE) + 1)
# check summary stats
delay_day_hr %>%
select(dep_delay, dep_delay_shifted) %>%
with(
apply(., 2, summary)
) %>% t() # transpose rows and columns
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## dep_delay -18 1.0543 6.571429 12.98602 15.4414 269 13
## dep_delay_shifted 1 20.0543 25.571429 31.98602 34.4414 288 13
Tips #3: with(data, ...)
function allows for referencing variable names inside the data frame (i.e., “var_name” instead of “data$var_name”). This is very useful when you work with various functions that were created outside the tidyverse syntax, while keeping your codes consistent with the tidyverse syntax.
Tips #4: apply(data, num, fun) applies function “fun” for each item in dimension “num” (1 = cows, 2= columns) of the data frame. The data referenced by “.” means all variables in the dataset.
Now the transformed distribution;
# Under the log of 10 transformation, the distribution looks closer to a normal distribution.
delay_day_hr %>% filter(!is.na(dep_delay_shifted)) %>%
ggplot(aes(x = dep_delay_shifted)) +
scale_x_log10() +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# # Alternatively, one can apply the natural logarithm to transform a variable. Histogram shows no difference here.
# delay_day_hr %>% filter(!is.na(dep_delay_shifted)) %>%
# ggplot(aes(x = dep_delay_shifted)) +
# scale_x_continuous(trans = "log") +
# geom_histogram(color = "white")
The transformed distribution is much less skewed than the original. Now, let’s plot the relationship between delays and flights again.
delay_day_hr %>% filter(!is.na(dep_delay_shifted), dep_delay_shifted > 5) %>%
ggplot(aes(x = n_obs, y = dep_delay_shifted)) +
scale_y_log10() + # using transformation scale_y_log10()
geom_point(alpha = 0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
Again we do not see a pattern of more delays for busier hours. It seems that the airports in New York City manage the fluctuating number of flights without causing congestion.