2.1 Materials
R is continuously evolving with user-contributed R packages, or a bundle of user-developed programs. Recent developments such as tidy, dplyr, and ggplot2 (often referred to as tidyverse tools) have greatly streamlined the coding for data manipulation and analysis, which is the starting point for learning R chosen for this site. This tidyverse syntax gives you an intuitive and powerful data operation language to wrangle, visualize, and analyze data.
Following dplyr’s documentation, let’s start with a sample dataset of airplane departures and arrivals. This contains information on about 337,000 flights departed from New York City in 2013 (source: Bureau of Transportation Statistics).
In the R console (i.e., the left bottom pane in RStudio), type install.packages("nycflights13")
and hit enter. Then, load the package via library(nycflights13)
, in the current computing environment (called R session). Its built-in data frames will be added to your R session.
Generally, R packages are installed locally on your computer on an as-needed basis. To install several more packages that we will use, copy the following code and execute it in your R console.
# Since we're just starting, don't worry about understanding the code here
# "#"" symbole is for adding comments that are helpful to humans but are ignored by R
required_pkgs <- c("nycflights13", "dplyr", "ggplot2", "lubridate", "knitr", "tidyr", "broom")
# creating a new object "required_pkgs" containing strings "nycflights13", "dplyr",..
# c(): concatenates string names here
# <-: an assignment operator from right to left
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages())]
# checking whether "required_pkgs" are already installed
# [ ]:extraction by logical TRUE or FALSE
# %in%: checks whether items on the left are members of the items on the right.
# !: a negation operator
if (length(new_pkgs)) {
install.packages(new_pkgs, repos = "http://cran.rstudio.com")
}
Once packages are downloaded and installed on your computer, they become available for your libraries. In each R session, we load libraries we need (instead of all existing libraries). Here we load the following;
library(dplyr) # for data manipulation
library(ggplot2) # for figures
library(lubridate) # for date manipulation
library(nycflights13) # sample data of NYC flights
library(knitr) # for table formatting
library(tidyr) # for table formatting
library(broom) # for table formatting
Let’s see the data.
class(flights) # shows the class attribute
dim(flights) # obtains dimention of rows and columns
## [1] "tbl_df" "tbl" "data.frame"
## [1] 336776 19
head(flights) # displays first seveal rows and columns
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
dim()
returns the dimensions of a data frame, and head()
returns the first several rows and columns. The flights
dataset contains information on dates, actual departure times, and arrival times, etc. The variables are arranged in columns, and each row is an observation of flight data.
In R, we refer to a dataset as data frame, which is a class of R object. The data frame class is more general than the matrix class in that it can contain variables of more than one mode (numeric, character, factor etc).