Piecemeal R

2.5 More examples 2

Let’s keep going with exmaples using flights data.

Previously, we find that the congestion at the airports is unlikely the cause of delays. Then, what else may explain the patterns of delays? Recall that earlier we observe that some airlines have longer delays than others for NYC-MSP flights. Are some airlines responsible for the delays? Let’s take a look at the average delays by carrier.

stat_carrier <- flights %>% 
  group_by(carrier) %>%
  summarise(n_obs = n(),
            dep_delay = mean(dep_delay, na.rm = TRUE),
            arr_delay = mean(arr_delay, na.rm = TRUE)
            ) %>% 
  left_join(airlines, by="carrier") %>%
  arrange(desc(n_obs)) 

stat_carrier %>% kable(digit=2)

carrier	n_obs	dep_delay	arr_delay	name
UA	58665	12.11	3.56	United Air Lines Inc.
B6	54635	13.02	9.46	JetBlue Airways
EV	54173	19.96	15.80	ExpressJet Airlines Inc.
DL	48110	9.26	1.64	Delta Air Lines Inc.
AA	32729	8.59	0.36	American Airlines Inc.
MQ	26397	10.55	10.77	Envoy Air
US	20536	3.78	2.13	US Airways Inc.
9E	18460	16.73	7.38	Endeavor Air Inc.
WN	12275	17.71	9.65	Southwest Airlines Co.
VX	5162	12.87	1.76	Virgin America
FL	3260	18.73	20.12	AirTran Airways Corporation
AS	714	5.80	-9.93	Alaska Airlines Inc.
F9	685	20.22	21.92	Frontier Airlines Inc.
YV	601	19.00	15.56	Mesa Airlines Inc.
HA	342	4.90	-6.92	Hawaiian Airlines Inc.
OO	32	12.59	11.93	SkyWest Airlines Inc.

While we see some differences, the simple average of delays over various routes, days, and hours of flights may not be a good measure for the comparison across carriers. For example, some carriers may serve the routes and hours that tend to have more delays. Also, given that our dataset covers only the flights from New York City, the comparison may not be nationally representative since carriers use different airports around the country as their regional hubs.

For our purposes, instead of delays, let’s compare the average air time among carriers, for whihch we can account for flight’s destination and timing. The differences in air time may indicate some efficiency differences.

Let’s first check how air time relates to flight distance.

flights %>% 
  filter (month == 1, day == 1, !is.na(air_time)) %>%
  ggplot(aes(x = distance, y = air_time)) + 
  geom_point(alpha = 0.05)  +  
  geom_smooth()

## `geom_smooth()` using method = 'loess'

air_time and distance show a general linear relationship. We can better account for this relationship if we calculate the average air time for each flight destination from New York City.

First, consider using a simple linear regression model of air time to account for varying destinations. The idea is to estimate the regression-mean air time for each destination, remove the destination effects from the total variation of air time, and then compare that across carriers.

# make a copy of flights data
flights2 <- flights

flights2 <- flights2 %>%  # handle missing data as NA 
  mutate( idx_na1 = is.na(air_time) | is.na(dest) )

res <- flights2 %>%  # estimate a linear model and obtain residuals 
  with( 
   lm( air_time ~ 0 + as.factor(dest), subset=!idx_na1)  
        # lm() estimates a linear model. 
        # "y ~ x"" is the formula for regressing y on x. 
        # "y ~ 0 + x" removes the intercept from the model. 
        # as.factor() converts "dest" to a factor (categorical) class
        # which is used as a set of dummy variables in the regression.  
   ) %>% 
  residuals()

summary(res)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -57.574  -6.909  -0.953   0.000   5.862 144.317

# define a variable containing a vector of NA
# replace rows with idx_na1 == FALSE
flights2$res <- NA
flights2$res[!flights2$idx_na1] <- res 

flights2 %>% 
  group_by(carrier) %>%
  summarise(
    mean_res = mean(res, na.rm = TRUE), # mean residual by carrier 
    sd_res = sd(res, na.rm = TRUE)
    ) %>%
  left_join(stat_carrier, by="carrier") %>% 
  select(carrier, name, n_obs, dep_delay, arr_delay, mean_res, sd_res) %>% 
  arrange(-n_obs) %>% 
  kable(digit=2)

carrier	name	n_obs	dep_delay	arr_delay	mean_res	sd_res
UA	United Air Lines Inc.	58665	12.11	3.56	-0.87	14.59
B6	JetBlue Airways	54635	13.02	9.46	0.28	11.55
EV	ExpressJet Airlines Inc.	54173	19.96	15.80	-0.37	8.94
DL	Delta Air Lines Inc.	48110	9.26	1.64	-0.20	12.32
AA	American Airlines Inc.	32729	8.59	0.36	0.68	13.86
MQ	Envoy Air	26397	10.55	10.77	0.45	8.87
US	US Airways Inc.	20536	3.78	2.13	-0.42	9.43
9E	Endeavor Air Inc.	18460	16.73	7.38	0.84	8.76
WN	Southwest Airlines Co.	12275	17.71	9.65	0.16	12.55
VX	Virgin America	5162	12.87	1.76	3.26	17.58
FL	AirTran Airways Corporation	3260	18.73	20.12	1.16	8.75
AS	Alaska Airlines Inc.	714	5.80	-9.93	-2.13	16.17
F9	Frontier Airlines Inc.	685	20.22	21.92	3.12	15.16
YV	Mesa Airlines Inc.	601	19.00	15.56	-0.05	7.06
HA	Hawaiian Airlines Inc.	342	4.90	-6.92	5.64	20.69
OO	SkyWest Airlines Inc.	32	12.59	11.93	1.02	7.26

It appears to make sense that the average air time is longer for low-cost carriers such as Virgin America, Frontier Airlines, and Hawaiian Airlines. The differences across other carriers, on the other hand, appear small relative to standard errors. To get a sense of whether these differences have any statistical significance, let’s use t-test to compare the mean residual between United Airlines and American Airlines.

# t-test comparing UA vs AA for the mean air time 
flights2 %>%
  with({
    idx_UA <- carrier == "UA"
    idx_AA <- carrier == "AA"
    t.test(res[idx_UA], res[idx_AA])
    })

## 
##  Welch Two Sample t-test
## 
## data:  res[idx_UA] and res[idx_AA]
## t = -15.722, df = 68826, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.741133 -1.355142
## sample estimates:
##  mean of x  mean of y 
## -0.8689523  0.6791852

With a large number of observations, a seemingly small difference in the means often turns out to be statistically significant. Nonetheless, the average difference of about 1.5 minute air time per flight appears very small.

In fact, we can do this sort of pair-wise comparison all at once using a regression. Using carrier fixed effects in addition to destination fixed effects, we can directly compare the mean effects across carriers. We will set United Airlines to be a reference of the carrier fixed effects, so that the fixed effect for United Airlines is set to zero (i.e., omitted category), from which the fixed effects of all other airlines are estimated.

flights2$carrier <- relevel(factor(flights2$carrier), ref="UA")  
# reference level is United Airlines
flights2$carrier %>% table()

## .
##    UA    9E    AA    AS    B6    DL    EV    F9    FL    HA    MQ    OO 
## 58665 18460 32729   714 54635 48110 54173   685  3260   342 26397    32 
##    US    VX    WN    YV 
## 20536  5162 12275   601

flights2 %>% 
  with({
    n_carrier <- unique(carrier) %>% length()
    n_dest <- unique(dest) %>% length() 
    print(paste('There are', n_carrier, 'distinct carriers and', 
                n_dest,'distinct destinations in the data.' ))
  })

## [1] "There are 16 distinct carriers and 105 distinct destinations in the data."

With 16 carriers and 105 destinations minus 2 reference levels for carriers and destinations, the total of 119 coefficients will be estimated for the fixed effects.

f1 <- flights2 %>%
 with(
   lm( air_time  ~  as.factor(carrier) + as.factor(dest) )  
    # fixed effects for carriers and destinations
 )

tidy(f1)[1:20,] # show the first 20 coefficients

##                    term     estimate  std.error    statistic       p.value
## 1           (Intercept)  247.9884874 0.75069658  330.3445016  0.000000e+00
## 2  as.factor(carrier)9E    1.8015498 0.12723996   14.1586788  1.702649e-45
## 3  as.factor(carrier)AA    1.9326712 0.09731105   19.8607572  1.002388e-87
## 4  as.factor(carrier)AS   -1.9071536 0.49596319   -3.8453531  1.204017e-04
## 5  as.factor(carrier)B6    1.1808039 0.08495098   13.8998267  6.535025e-44
## 6  as.factor(carrier)DL    0.7531812 0.08722600    8.6348244  5.907432e-18
## 7  as.factor(carrier)EV    0.4174574 0.11044837    3.7796605  1.570702e-04
## 8  as.factor(carrier)F9    3.8891981 0.48090201    8.0872985  6.120836e-16
## 9  as.factor(carrier)FL    2.6434074 0.27600661    9.5773336  1.002386e-21
## 10 as.factor(carrier)HA   11.0125104 0.89821710   12.2604106  1.503557e-34
## 11 as.factor(carrier)MQ    1.4592669 0.11892133   12.2708590  1.321669e-34
## 12 as.factor(carrier)OO    1.8091432 2.21222472    0.8177936  4.134757e-01
## 13 as.factor(carrier)US    0.1319337 0.13826299    0.9542230  3.399715e-01
## 14 as.factor(carrier)VX    4.5298528 0.18441295   24.5636378 4.086448e-133
## 15 as.factor(carrier)WN    1.2226161 0.17520980    6.9780125  2.999500e-12
## 16 as.factor(carrier)YV    0.5167461 0.52737831    0.9798395  3.271661e-01
## 17   as.factor(dest)ACK -207.1011095 1.04478912 -198.2228803  0.000000e+00
## 18   as.factor(dest)ALB -216.6188634 0.95130943 -227.7059972  0.000000e+00
## 19   as.factor(dest)ANC  165.1365126 4.26930687   38.6799351  0.000000e+00
## 20   as.factor(dest)ATL -136.1282095 0.75641976 -179.9638460  0.000000e+00

# a function to clean up the coefficient table above  
clean_lm_rlt <- function(f) {
  # keep only rows for which column "term" contains "carrier"  e.g., rows 2 to 16 above
  rlt <- tidy(f) %>% filter(grepl("carrier",term)) 

  # create column named carrier 
  rlt <- rlt %>% mutate(carrier = gsub('as.factor\\(carrier\\)','', term)) 

  # drop column term
  rlt <- rlt %>% select(-term)
  
  # add columns of carrier, name, and n_obs from the stat_carrier data frame
  stat_carrier %>%  
    select(carrier, name, n_obs) %>%
    left_join(rlt, by="carrier") 
}

clean_lm_rlt(f1) %>% kable(digit=2)

carrier	name	n_obs	estimate	std.error	statistic	p.value
UA	United Air Lines Inc.	58665	NA	NA	NA	NA
B6	JetBlue Airways	54635	1.18	0.08	13.90	0.00
EV	ExpressJet Airlines Inc.	54173	0.42	0.11	3.78	0.00
DL	Delta Air Lines Inc.	48110	0.75	0.09	8.63	0.00
AA	American Airlines Inc.	32729	1.93	0.10	19.86	0.00
MQ	Envoy Air	26397	1.46	0.12	12.27	0.00
US	US Airways Inc.	20536	0.13	0.14	0.95	0.34
9E	Endeavor Air Inc.	18460	1.80	0.13	14.16	0.00
WN	Southwest Airlines Co.	12275	1.22	0.18	6.98	0.00
VX	Virgin America	5162	4.53	0.18	24.56	0.00
FL	AirTran Airways Corporation	3260	2.64	0.28	9.58	0.00
AS	Alaska Airlines Inc.	714	-1.91	0.50	-3.85	0.00
F9	Frontier Airlines Inc.	685	3.89	0.48	8.09	0.00
YV	Mesa Airlines Inc.	601	0.52	0.53	0.98	0.33
HA	Hawaiian Airlines Inc.	342	11.01	0.90	12.26	0.00
OO	SkyWest Airlines Inc.	32	1.81	2.21	0.82	0.41

The “estimate” column shows the mean difference in air time comapred to United Airlines, accounting for the flight destination. The estimate tends to be more precise (i.e., smaller standard errors) for carriers with a larger number of observations. This time, we find that Virgin America, Air Tran, Frontier Airlines, and Hawaiian Airlines tend to show particularly longer air times than United Airlines.

Next, let’s take a step further to account for flight timing as well. We can do this by adding fixed effects for flight dates and hours.

flights2 <- flights2 %>%
  mutate( date_id = month*100 + day )

flights2$date_id %>% unique() %>% length()

## [1] 365

f2 <- flights2 %>%
 with(
   lm( air_time ~  as.factor(carrier) + as.factor(dest) +
        + as.factor(date_id) + as.factor(hour) )
 )

lm_rlt2 <- clean_lm_rlt(f2)
lm_rlt2 %>% kable(digit=2)

carrier	name	n_obs	estimate	std.error	statistic	p.value
UA	United Air Lines Inc.	58665	NA	NA	NA	NA
B6	JetBlue Airways	54635	1.60	0.07	22.50	0.00
EV	ExpressJet Airlines Inc.	54173	0.61	0.09	6.67	0.00
DL	Delta Air Lines Inc.	48110	0.95	0.07	13.03	0.00
AA	American Airlines Inc.	32729	1.84	0.08	22.81	0.00
MQ	Envoy Air	26397	1.45	0.10	14.70	0.00
US	US Airways Inc.	20536	0.17	0.11	1.51	0.13
9E	Endeavor Air Inc.	18460	1.57	0.11	14.72	0.00
WN	Southwest Airlines Co.	12275	1.14	0.15	7.82	0.00
VX	Virgin America	5162	4.85	0.15	31.57	0.00
FL	AirTran Airways Corporation	3260	2.19	0.23	9.58	0.00
AS	Alaska Airlines Inc.	714	-2.55	0.41	-6.21	0.00
F9	Frontier Airlines Inc.	685	3.31	0.40	8.29	0.00
YV	Mesa Airlines Inc.	601	0.32	0.44	0.73	0.46
HA	Hawaiian Airlines Inc.	342	11.79	0.75	15.80	0.00
OO	SkyWest Airlines Inc.	32	7.63	1.83	4.17	0.00

lm_rlt2 %>% filter(carrier!='UA') %>%
  ggplot(aes(x = carrier, y = estimate)) + geom_col() +
  labs(title = "Mean Air Time Compared to United Airlines")

The results are similar to the previous linear mode except that this time SkyWest Airlines shows much longer air time.

Our final model is a check for the robustness of the above results. Let’s replace the date and hour fixed effects in the previous model with date-hour fixed effects (i.e., the interaction between date and hour). We could add such fixed effects using time_hour variable defined above. However, that would mean adding nearly 7,000 dummy variables to our linear regression, which is computationally intensive.

To work around this issue, we approximate this estimation by pre-processing the dependent variable. Specifically, we calculate the average air time for each combination of time_hour and dest and define a new dependent variable by subtracting this average value from the original air time variable (i.e., the new variable is centered at zero-mean for each combination of time_hour and dest). Then, we estimate a linear model with carrier and destination fixed effects.

## Adding time_hour fixed effects is computationally intensive 
# f1 <- flights %>%
#  with(
#    lm( air_time  ~  as.factor(carrier) + as.factor(dest) + as.factor(time_hour))   
#  )

unique(flights2$time_hour) %>% length()  # 6,936 unique time_hour

## [1] 6936

flights2 <- flights2 %>% 
  group_by(dest, time_hour) %>%   
  mutate(
    air_time_centered = air_time - mean(air_time, na.rm=TRUE) 
  )

f3 <- flights2 %>%
 with(
   lm( air_time_centered  ~  as.factor(carrier) + as.factor(dest) )
 )
  
lm_rlt3 <- clean_lm_rlt(f3)
lm_rlt3 %>% kable(digit=2) # Note: standard errors, t-stat, and p-val are incorrect

carrier	name	n_obs	estimate	std.error	statistic	p.value
UA	United Air Lines Inc.	58665	NA	NA	NA	NA
B6	JetBlue Airways	54635	0.82	0.03	32.24	0.00
EV	ExpressJet Airlines Inc.	54173	0.88	0.03	26.50	0.00
DL	Delta Air Lines Inc.	48110	0.52	0.03	19.85	0.00
AA	American Airlines Inc.	32729	1.20	0.03	41.06	0.00
MQ	Envoy Air	26397	1.00	0.04	27.84	0.00
US	US Airways Inc.	20536	-0.09	0.04	-2.21	0.03
9E	Endeavor Air Inc.	18460	1.27	0.04	33.07	0.00
WN	Southwest Airlines Co.	12275	1.30	0.05	24.70	0.00
VX	Virgin America	5162	3.47	0.06	62.59	0.00
FL	AirTran Airways Corporation	3260	1.78	0.08	21.48	0.00
AS	Alaska Airlines Inc.	714	-2.86	0.15	-19.15	0.00
F9	Frontier Airlines Inc.	685	1.99	0.14	13.73	0.00
YV	Mesa Airlines Inc.	601	0.78	0.16	4.89	0.00
HA	Hawaiian Airlines Inc.	342	1.34	0.27	4.96	0.00
OO	SkyWest Airlines Inc.	32	3.50	0.67	5.26	0.00

lm_rlt3 %>% filter(carrier!='UA') %>%
  ggplot(aes(x = carrier, y = estimate)) + geom_col() +
  labs(title = "Mean Air Time Compared to United Airlines: Robustness Check")

The point estimates should be approximately what we would obtain if we regress air_time on the fixed effects of carrier, dest, and time_hour. However, the standard errors are not correctly displayed in the table because the centered variable has a smaller total variation compared to the original air_time variable. (Correct standard errors can be obtained, for example, through a bootstrapping technique.)

Overall, we see again a tendency that lower-cost carriers like Sky West Airlines, Virgin America, Frontier Airlines, and Air Tran show particularly longer air time than United Airlines. Jet Blue Airways, another low-cost carrier, shows a less obvious difference from United Airlines, possibly suggesting that their operation focused on the East Cost is efficient for the flights departing from New York City. Hawaiian Airlines and Alaskan Airlines appear to be somewhat different from other carriers perhaps because they are more specialized in particular flight destinations compared to their rivals. In particular, the flights to Hawaii may have distinct delay patterns that are concentrated on peak vacation seasons.