Chapter 6 Working with Data

Bringing data into R, exporting data from R in a form that is readable by other software, cleaning and reshaping data, and other data manipulation tasks are an important and often overlooked component of data science. The book Spector (2008), while a few years old, is still an excellent reference for data-related issues. And the manual, available online at http://catalog.lib.msu.edu/record=b7254984~S39a , is an up-to-date (and free) reference on importing a wide variety of datasets into R and on exporting data in various forms.

6.1 Reading Data into R

Data come in a dizzying variety of forms. It might be in a proprietary format such as an .xlsx Excel file, a .sav SPSS file, or a .mtw Minitab file. It might be structured using a relational model comprising many tables that need to be connected via key-value pairs. It might be a data-interchange format such as JSON (JavaScript Object Notation), or a markup language such as XML (Extensible Markup Language), perhaps with specialized standards for describing ecological information, see EML (Ecological Metadata Language). Both XML and EML are common data metadata formats (i.e., data that provides information about other data). Fortunately many datasets are (or can be) saved as plain text files, and most software can both read and write such files, so our initial focus will be on reading plain text files into R and saving data from R in plain text format. RStudio provides a handy data import cheat sheet for many of the read functions detailed in this section.

The foreign R package provides functions to directly read data saved in some of the proprietary formats into R, which is sometimes unavoidable, but if possible it is good to save data from another package as plain text and then read this plain text file into R. In Chapter 10 methods for reading web-based data sets into R will be discussed.

The function read.table() and its offshoots such as read.csv() are used to read in rectangular data from a text file. For example, the file BrainAndBody.csv contains data³³ on the brain weight, body weight, and name of some terrestrial animals. Here are the first few lines of that file:

body,brain,name
1.35,8.1,Mountain beaver
465,423,Cow
36.33,119.5,Grey wolf
27.66,115,Goat
1.04,5.5,Guinea pig

As is evident, the first line of the file contains the names of the three variables, separated (delimited) by commas. Each subsequent line contains the body weight, brain weight, and name of a specific terrestrial animal.

This file is accessible at the url https://www.finley-lab.com/files/data/BrainAndBody.csv. The read.table() function is used to read these data into an R data frame.

> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.csv"
> BrainBody <- read.table(file = u.bb, header = TRUE, sep = ",", 
+                         stringsAsFactors = FALSE)
> head(BrainBody)

      body brain            name
1     1.35   8.1 Mountain beaver
2   465.00 423.0             Cow
3    36.33 119.5       Grey wolf
4    27.66 115.0            Goat
5     1.04   5.5      Guinea pig
6 11700.00  50.0     Dipliodocus

The arguments used in this call to read.table() include:

file = u.bb, which tells R the location of the file. In this case the string https://www.finley-lab.com/files/data/BrainAndBody.csv giving the location is rather long, so it was first assigned to the object u.bb.
header = TRUE, which tells R the first line of the file gives the names of the variables.
sep = ",", which tells R that a comma separates the fields in the file.
stringsAsFactors = FALSE which tells R not to convert character vectors to factors.

The function read.csv() is the same as read.table() except the default separator is a comma, whereas the default separator for read.table() is whitespace.

The file BrainAndBody.tsv contains the same data, except a tab is used in place of a comma to separate fields. The only change needed to read in the data in this file is in the sep argument (and of course the file argument, since the data are stored in a different file):

> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.tsv"
> BrainBody2 <- read.table(file = u.bb, header = TRUE, sep = "\t", 
+                          stringsAsFactors = FALSE)
> head(BrainBody2)

      body brain            name
1     1.35   8.1 Mountain beaver
2   465.00 423.0             Cow
3    36.33 119.5       Grey wolf
4    27.66 115.0            Goat
5     1.04   5.5      Guinea pig
6 11700.00  50.0     Dipliodocus

File extensions, e.g., .csv or .tsv, are naming conventions only and are there to remind us how the columns are separated. In other words, they have no influence on R’s file read functions.

A third file, BrainAndBody.txt, contains the same data, but also contains a few lines of explanatory text above the names of the variables. It also uses whitespace rather than a comma or a tab as a separator. Here are the first several lines of the file.

This file contains data
on brain and body
weights of several terrestrial animals

"body" "brain" "name"
1.35 8.1 "Mountain beaver"
465 423 "Cow"
36.33 119.5 "Grey wolf"
27.66 115 "Goat"
1.04 5.5 "Guinea pig"
11700 50 "Dipliodocus"
2547 4603 "Asian elephant"

Notice that in this file the values of name are put inside of quotation marks. This is necessary since instead R would (reasonably) assume the first line contained the values of four variables, the values being 1.35, 8.1, Mountain, and beaver while in reality there are only three values desired, with Mountain beaver being the third.

To read in this file we need to tell R to skip the first five lines and to use whitespace as the separator. The skip argument handles the first, and the sep argument the second. First let’s see what happens if we don’t use the skip argument.

> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.txt"
> BrainBody3 <- read.table(u.bb, header = TRUE, sep = " ", 
+                          stringsAsFactors = FALSE)

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 5 elements

R assumed the first line of the file contained the variable names, since header = TRUE was specified, and counted four including This, file, contains, and data. So in the first line of actual data, R expected four columns containing data plus possibly a fifth column containing row names for the data set, and complained that “line 1 did not have 5 elements.” The error message is somewhat mysterious, since it starts with “Error in scan.” This happens because read.table() actually uses a more basic R function called scan() to do the work.

Here’s how to read in the file correctly.

> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.txt"
> BrainBody3 <- read.table(u.bb, header = TRUE, sep = " ", 
+                          stringsAsFactors = FALSE, skip = 4)
> BrainBody3[1:10,]

       body  brain            name
1      1.35    8.1 Mountain beaver
2    465.00  423.0             Cow
3     36.33  119.5       Grey wolf
4     27.66  115.0            Goat
5      1.04    5.5      Guinea pig
6  11700.00   50.0     Dipliodocus
7   2547.00 4603.0  Asian elephant
8    187.10  419.0          Donkey
9    521.00  655.0           Horse
10    10.00  115.0    Potar monkey

6.2 Reading Data with Missing Observations

Missing data are represented in many ways. Sometimes a missing data point is just that, i.e., the place where it should be in the file is blank. Other times specific numbers such as $-9999$ or specific symbols are used. The read.table() function has an argument na.string that allows the user to specify how missing data is indicated in the source file.

The site http://www.wunderground.com/history/ makes weather data available for locations around the world from dates going back to 1945. The file WeatherKLAN2014.csv contains weather data for Lansing, Michigan for the year 2014. Here are the first few lines of that file:

EST,Max TemperatureF,Min TemperatureF, Events
1/1/14,14,9,Snow
1/2/14,13,-3,Snow
1/3/14,13,-11,Snow
1/4/14,31,13,Snow
1/5/14,29,16,Fog-Snow
1/6/14,16,-12,Fog-Snow
1/7/14,2,-13,Snow
1/8/14,17,-1,Snow
1/9/14,21,2,Snow
1/10/14,39,21,Fog-Rain-Snow
1/11/14,41,32,Fog-Rain
1/12/14,39,31,

Look at the last line, and notice that instead of an Event such as Snow or Fog-Snow there is nothing after the comma. This observation is missing, but rather than using an explicit code such as NA, the site just leaves that entry blank. To read these data into R we will supply the argument na.string = "" which tells R the file indicates missing data by leaving the appropriate entry blank.

> u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014.csv"
> WeatherKLAN2014 <- read.csv(u.weather, header=TRUE, 
+                             stringsAsFactors = FALSE, na.string = "")
> WeatherKLAN2014[1:15,]

       EST Max.TemperatureF Min.TemperatureF
1   1/1/14               14                9
2   1/2/14               13               -3
3   1/3/14               13              -11
4   1/4/14               31               13
5   1/5/14               29               16
6   1/6/14               16              -12
7   1/7/14                2              -13
8   1/8/14               17               -1
9   1/9/14               21                2
10 1/10/14               39               21
11 1/11/14               41               32
12 1/12/14               39               31
13 1/13/14               44               34
14 1/14/14               37               26
15 1/15/14               27               18
          Events
1           Snow
2           Snow
3           Snow
4           Snow
5       Fog-Snow
6       Fog-Snow
7           Snow
8           Snow
9           Snow
10 Fog-Rain-Snow
11      Fog-Rain
12          <NA>
13          Rain
14     Rain-Snow
15          Snow

6.3 Summarizing Data Frames

Some common data tasks include variable summaries such as means or standard deviations, transforming an existing variable, and creating new variables. As with many tasks, there are several ways to accomplish each of these.

6.3.1 Column (and Row) Summaries

The file WeatherKLAN2014Full.csv contains a more complete set of weather data variables than WeatherKLAN2014.csv, from the same source, http://www.wunderground.com/history.

> u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
> WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE, 
+                                stringsAsFactors = FALSE, 
+                                na.string = "")
> names(WeatherKLAN2014Full)

 [1] "EST"                      
 [2] "Max.TemperatureF"         
 [3] "Mean.TemperatureF"        
 [4] "Min.TemperatureF"         
 [5] "Max.Dew.PointF"           
 [6] "MeanDew.PointF"           
 [7] "Min.DewpointF"            
 [8] "Max.Humidity"             
 [9] "Mean.Humidity"            
[10] "Min.Humidity"             
[11] "Max.Sea.Level.PressureIn" 
[12] "Mean.Sea.Level.PressureIn"
[13] "Min.Sea.Level.PressureIn" 
[14] "Max.VisibilityMiles"      
[15] "Mean.VisibilityMiles"     
[16] "Min.VisibilityMiles"      
[17] "Max.Wind.SpeedMPH"        
[18] "Mean.Wind.SpeedMPH"       
[19] "Max.Gust.SpeedMPH"        
[20] "PrecipitationIn"          
[21] "CloudCover"               
[22] "Events"                   
[23] "WindDirDegrees"

How can we compute the mean for each variable? One possibility is to do this a variable at a time:

> mean(WeatherKLAN2014Full$Mean.TemperatureF)

[1] 45.78

> mean(WeatherKLAN2014Full$Min.TemperatureF)

[1] 36.25

> mean(WeatherKLAN2014Full$Max.TemperatureF)

[1] 54.84

> ## Et Cetera

This is pretty inefficient. Fortunately there is a colMeans() function which computes the mean of each column (or a specified number of columns) in a data frame. Some columns in the current data frame are not numeric, and obviously we don’t want to ask R to compute means for these columns. We use str() to investigate.

> str(WeatherKLAN2014Full)

'data.frame':   365 obs. of  23 variables:
 $ EST                      : chr  "2014-1-1" "2014-1-2" "2014-1-3" "2014-1-4" ...
 $ Max.TemperatureF         : int  14 13 13 31 29 16 2 17 21 39 ...
 $ Mean.TemperatureF        : int  12 5 1 22 23 2 -5 8 12 30 ...
 $ Min.TemperatureF         : int  9 -3 -11 13 16 -12 -13 -1 2 21 ...
 $ Max.Dew.PointF           : int  9 7 2 27 27 11 -6 7 18 37 ...
 $ MeanDew.PointF           : int  4 4 -5 18 21 -4 -13 1 8 28 ...
 $ Min.DewpointF            : int  0 -8 -14 3 11 -18 -18 -6 0 19 ...
 $ Max.Humidity             : int  88 76 83 92 92 80 78 88 88 100 ...
 $ Mean.Humidity            : int  76 70 68 73 86 73 72 78 75 92 ...
 $ Min.Humidity             : int  63 63 53 53 80 65 65 67 62 84 ...
 $ Max.Sea.Level.PressureIn : num  30.4 30.4 30.5 30.1 30 ...
 $ Mean.Sea.Level.PressureIn: num  30.3 30.2 30.4 30 29.9 ...
 $ Min.Sea.Level.PressureIn : num  30.2 30.1 30.1 29.9 29.7 ...
 $ Max.VisibilityMiles      : int  10 9 10 10 4 10 10 10 10 9 ...
 $ Mean.VisibilityMiles     : int  4 4 10 6 1 2 6 10 7 3 ...
 $ Min.VisibilityMiles      : int  1 0 5 1 0 0 1 8 2 0 ...
 $ Max.Wind.SpeedMPH        : int  17 22 23 28 22 31 25 15 14 17 ...
 $ Mean.Wind.SpeedMPH       : int  9 13 10 15 11 18 15 7 6 10 ...
 $ Max.Gust.SpeedMPH        : int  22 30 32 36 30 40 31 18 17 22 ...
 $ PrecipitationIn          : chr  "0.08" "0.01" "0.00" "0.12" ...
 $ CloudCover               : int  8 7 1 5 8 8 6 5 7 8 ...
 $ Events                   : chr  "Snow" "Snow" "Snow" "Snow" ...
 $ WindDirDegrees           : int  43 24 205 203 9 262 220 236 147 160 ...

It isn’t surprising that EST and Events are not numeric, but is surprising that PrecipitationIn, which measures precipitation in inches, also is not numeric, but is character. Let’s investigate further.

> WeatherKLAN2014Full$PrecipitationIn[1:50]

 [1] "0.08" "0.01" "0.00" "0.12" "0.78" "0.07" "T"   
 [8] "T"    "0.01" "0.39" "0.16" "0.00" "0.00" "0.01"
[15] "T"    "0.08" "T"    "T"    "T"    "0.01" "0.00"
[22] "0.05" "T"    "T"    "0.07" "0.23" "0.04" "T"   
[29] "T"    "0.03" "T"    "0.37" "T"    "0.00" "T"   
[36] "0.27" "0.01" "T"    "0.04" "0.03" "T"    "0.00"
[43] "0.00" "T"    "T"    "0.00" "0.02" "0.15" "0.08"
[50] "0.01"

Now it’s more clear. The original data file included T in the precipitation column to represent a “trace” of precipitation, which is precipitation greater than $0$ but less than $0.01$ inches. One possibility would be to set all these values to "0", and then to convert the column to numeric. For now we will just leave the PrecipitationIn column out of the columns for which we request the mean.

> colMeans(WeatherKLAN2014Full[, c(2:19, 21, 23)])

         Max.TemperatureF         Mean.TemperatureF 
                   54.838                    45.781 
         Min.TemperatureF            Max.Dew.PointF 
                   36.255                    41.800 
           MeanDew.PointF             Min.DewpointF 
                   36.395                    30.156 
             Max.Humidity             Mean.Humidity 
                   88.082                    70.392 
             Min.Humidity  Max.Sea.Level.PressureIn 
                   52.200                    30.130 
Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn 
                   30.015                    29.904 
      Max.VisibilityMiles      Mean.VisibilityMiles 
                    9.896                     8.249 
      Min.VisibilityMiles         Max.Wind.SpeedMPH 
                    4.825                    19.101 
       Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH 
                    8.679                        NA 
               CloudCover            WindDirDegrees 
                    4.367                   205.000

6.3.2 The `apply()` Function

R also has functions rowMeans(), colSums(), and rowSums(). But what if we want to compute the median or standard deviation of columns of data, or some other summary statistic? For this the apply() function can be used. This function applies a user-chosen function to either the rows or columns (or both) of a data frame. The arguments are:

X: the data frame of interest
MARGIN: specifying either rows (MARGIN = 1) or columns (MARGIN = 2)
FUN: the function to be applied.

> apply(X = WeatherKLAN2014Full[, c(2:19, 21, 23)], MARGIN = 2, 
+   FUN = sd)

         Max.TemperatureF         Mean.TemperatureF 
                  22.2130                   20.9729 
         Min.TemperatureF            Max.Dew.PointF 
                  20.2597                   19.5167 
           MeanDew.PointF             Min.DewpointF 
                  20.0311                   20.8511 
             Max.Humidity             Mean.Humidity 
                   8.1910                    9.3660 
             Min.Humidity  Max.Sea.Level.PressureIn 
                  13.9462                    0.2032 
Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn 
                   0.2159                    0.2360 
      Max.VisibilityMiles      Mean.VisibilityMiles 
                   0.5790                    2.1059 
      Min.VisibilityMiles         Max.Wind.SpeedMPH 
                   3.8168                    6.4831 
       Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH 
                   3.8863                        NA 
               CloudCover            WindDirDegrees 
                   2.7798                   90.0673

As with any R function the arguments don’t need to be named as long as they are specified in the correct order, so

> apply(WeatherKLAN2014Full[, c(2:19, 21, 23)], 2, sd)

has the same result.

6.3.3 Practice Problems

Notice the output value of NA for the column Max.Gust.SpeedMPH. Why does this happen? Figure out a way to make the apply() function return a numeric value for this column.
The apply() family of functions is extremely important in R, so it gets two Practice Problems :) Use the apply() function to compute the median values for all numeric columns in the iris data set.

> apply(WeatherKLAN2014Full[, c(2:19, 21, 23)], 2, sd, na.rm = TRUE)
> apply(iris[, -5], 2, median)

6.3.4 Saving Typing Using `with()`

Consider calculating the mean of the maximum temperature values for those days where the cloud cover is less than 4 and when the maximum humidity is over 85. We can do this using subsetting.

> mean(WeatherKLAN2014Full$Max.TemperatureF[
+                              WeatherKLAN2014Full$CloudCover < 4 & 
+                              WeatherKLAN2014Full$Max.Humidity > 85])

[1] 69.39

While this works, it requires a lot of typing, since each time we refer to a variable in the data set we need to preface its name by WeatherKLAN2014Full$. The with() function tells R that we are working with a particular data frame, and we don’t need to keep typing the name of the data frame.

> with(WeatherKLAN2014Full, mean(Max.TemperatureF[CloudCover < 
+   4 & Max.Humidity > 85]))

[1] 69.39

6.4 Transforming a Data Frame

Variables are often added to, removed from, changed in, or rearranged in a data frame. The subsetting features of R make this reasonably easy. We will investigate this in the context of the gapminder data frame. If the gapminder library is not yet installed, use install.packages("gapminder") to install it locally.

> library(gapminder)
> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  346 364 384 408 433 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...

6.4.1 Adding Variables

The data frame contains per capita GDP and population, and it might be interesting to create a variable that gives the total GDP by multiplying these two variables. (If we were interested in an accurate value for the total GDP we would probably be better off getting this information directly, since it is likely that the per capita GDP values in the data frame are rounded substantially.)

> gapminder$TotalGDP <- gapminder$gdpPercap * gapminder$pop
> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  7 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  346 364 384 408 433 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...
 $ TotalGDP : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...

Analogous to the with() function, there is a function within() which can simplify the syntax. Whereas with() does not change the data frame, within() can. Note, below I first remove the altered gapminder dataframe using rm() then bring a clean copy back in by reloading the gapminder package.

> rm(gapminder)
> library(gapminder)
> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...

> gapminder <- within(gapminder, TotalGDP <- gdpPercap * pop)
> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  7 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...
 $ TotalGDP : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...

A nice feature of within() is its ability to add more than one variable at a time to a data frame. In this case the two or more formulas creating new variables must be enclosed in braces.

> gapminder <- within(gapminder, {TotalGDP <- gdpPercap * pop
+     lifeExpMonths <- lifeExp * 12})
> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  8 variables:
 $ country      : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent    : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year         : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp      : num  28.8 30.3 32 34 36.1 ...
 $ pop          : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap    : num  779 821 853 836 740 ...
 $ TotalGDP     : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
 $ lifeExpMonths: num  346 364 384 408 433 ...

6.4.2 Removing Variables

After reflection we may realize the new variables we added to the gapminder data frame are not useful, and should be removed.

> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  8 variables:
 $ country      : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent    : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year         : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp      : num  28.8 30.3 32 34 36.1 ...
 $ pop          : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap    : num  779 821 853 836 740 ...
 $ TotalGDP     : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
 $ lifeExpMonths: num  346 364 384 408 433 ...

> gapminder <- gapminder[1:6]
> str(gapminder)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...

The same result could be obtained via gapminder <- gapminder[, 1:6]. The first method uses the fact that a data frame is also a list, and uses list subsetting methods. It is slightly preferable, since even if only one variable is retained, the object will still be a data frame, while the other method can return a vector in this case. Note this difference in the resulting x variable below (again this behavior can be frustrating at times if it is not anticipated).

> a <- data.frame(x = 1:3, y = c("dog", "cat", "pig"), z = seq(from = 1, 
+   to = 2, length = 3))
> a

  x   y   z
1 1 dog 1.0
2 2 cat 1.5
3 3 pig 2.0

> a <- a[1]
> a

> a <- data.frame(x = 1:3, y = c("dog", "cat", "pig"), z = seq(from = 1, 
+   to = 2, length = 3))
> a

  x   y   z
1 1 dog 1.0
2 2 cat 1.5
3 3 pig 2.0

> a <- a[, 1]
> a

[1] 1 2 3

One can also use a negative sign in front of the variable number(s). For example, a[-(2:3)] would drop the first two columns of a. Some care is needed when removing variables using the negative sign.

An alternative approach is to set the variables you’d like to remove to NULL. For example, a[c("y","z")] <- NULL and a[,2:3] <- NULL produce the same result as above.

6.4.3 Practice Problem

What happens if you write a[-2:3] instead of a[-(2:3)]? Why are the parentheses important here?

6.4.4 Transforming Variables

Consider the gapminder data again. Possibly we don’t want to add a new variable that gives life expectancy in months, but rather want to modify the existing variable to measure life expectancy in months. Here are two ways to accomplish this.

> rm(gapminder)
> library(gapminder)
> gapminder$lifeExp[1:5]

[1] 28.80 30.33 32.00 34.02 36.09

> gapminder$lifeExp <- gapminder$lifeExp * 12
> gapminder$lifeExp[1:5]

[1] 345.6 364.0 384.0 408.2 433.1

> rm(gapminder)
> library(gapminder)
> gapminder$lifeExp[1:5]

[1] 28.80 30.33 32.00 34.02 36.09

> gapminder <- within(gapminder, lifeExp <- lifeExp * 12)
> gapminder$lifeExp[1:5]

[1] 345.6 364.0 384.0 408.2 433.1

6.5 Rearranging Variables

Consider the full weather data set again.

> u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
> WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE, 
+                                 stringsAsFactors = FALSE,
+                                 na.string = "")
> names(WeatherKLAN2014Full)

 [1] "EST"                      
 [2] "Max.TemperatureF"         
 [3] "Mean.TemperatureF"        
 [4] "Min.TemperatureF"         
 [5] "Max.Dew.PointF"           
 [6] "MeanDew.PointF"           
 [7] "Min.DewpointF"            
 [8] "Max.Humidity"             
 [9] "Mean.Humidity"            
[10] "Min.Humidity"             
[11] "Max.Sea.Level.PressureIn" 
[12] "Mean.Sea.Level.PressureIn"
[13] "Min.Sea.Level.PressureIn" 
[14] "Max.VisibilityMiles"      
[15] "Mean.VisibilityMiles"     
[16] "Min.VisibilityMiles"      
[17] "Max.Wind.SpeedMPH"        
[18] "Mean.Wind.SpeedMPH"       
[19] "Max.Gust.SpeedMPH"        
[20] "PrecipitationIn"          
[21] "CloudCover"               
[22] "Events"                   
[23] "WindDirDegrees"

If we want the wind speed variables to come right after the date, we can again use subsetting.

> WeatherKLAN2014Full <- WeatherKLAN2014Full[c(1, 17, 18, 
+   19, 2:16, 20:23)]
> names(WeatherKLAN2014Full)

 [1] "EST"                      
 [2] "Max.Wind.SpeedMPH"        
 [3] "Mean.Wind.SpeedMPH"       
 [4] "Max.Gust.SpeedMPH"        
 [5] "Max.TemperatureF"         
 [6] "Mean.TemperatureF"        
 [7] "Min.TemperatureF"         
 [8] "Max.Dew.PointF"           
 [9] "MeanDew.PointF"           
[10] "Min.DewpointF"            
[11] "Max.Humidity"             
[12] "Mean.Humidity"            
[13] "Min.Humidity"             
[14] "Max.Sea.Level.PressureIn" 
[15] "Mean.Sea.Level.PressureIn"
[16] "Min.Sea.Level.PressureIn" 
[17] "Max.VisibilityMiles"      
[18] "Mean.VisibilityMiles"     
[19] "Min.VisibilityMiles"      
[20] "PrecipitationIn"          
[21] "CloudCover"               
[22] "Events"                   
[23] "WindDirDegrees"

6.6 Reshaping Data

A data set can be represented in several different formats. Consider a (fictitious) data set on incomes of three people during three different years. Here is one representation of the data:

> yearlyIncomeWide

          name income1990 income2000 income2010
1   John Smith      29784      39210      41213
2     Jane Doe      56789      89321     109321
3 Albert Jones       2341      34567      56781

Here is another representation of the same data:

> yearlyIncomeLong

          name       year income
1   John Smith income1990  29784
2     Jane Doe income1990  56789
3 Albert Jones income1990   2341
4   John Smith income2000  39210
5     Jane Doe income2000  89321
6 Albert Jones income2000  34567
7   John Smith income2010  41213
8     Jane Doe income2010 109321
9 Albert Jones income2010  56781

For hopefully obvious reasons, the first representation is called a wide representation of the data, and the second is called a long representation. Each has its merits. The first representation is probably easier for people to read, while the second is often the form needed for analysis by statistical software such as R. There are of course other representations. For example the rows and columns could be interchanged to create a different wide representation, or the long representation, which currently groups data by year, could group by name instead.

Whatever the relative merits of wide and long representations of data, transforming data from wide to long or long to wide is often required. As with many tasks, there are several ways to accomplish this in R. We will focus on a library called tidyr written by Hadley Wickham that performs the transformations and more.

6.6.1 `tidyr`

The R library tidyr has functions for converting data between formats. To illustrate its use, we examine a simple data set that explores the relationship between religion and income in the United States. The data come from a Pew survey, and are used in the tidyr documentation to illustrate transforming data from wide to long format.

> u.rel <- "https://www.finley-lab.com/files/data/religion2.csv"
> religion <- read.csv(u.rel, header = TRUE, stringsAsFactors = FALSE)
> head(religion)

            religion under10k btw10and20k btw20and30k
1           Agnostic       27          34          60
2            Atheist       12          27          37
3           Buddhist       27          21          30
4           Catholic      418         617         732
5 DoNotKnowOrRefused       15          14          15
6    EvangelicalProt      575         869        1064
  btw30and40k btw40and50k btw50and75k btw75and100k
1          81          76         137          122
2          52          35          70           73
3          34          33          58           62
4         670         638        1116          949
5          11          10          35           21
6         982         881        1486          949
  btw100and150k over150k DoNotKnowOrRefused
1           109       84                 96
2            59       74                 76
3            39       53                 54
4           792      633               1489
5            17       18                116
6           723      414               1529

As given, the columns include religion and income level, and there are counts for each of the combinations of religion and income level. For example, there are 27 people who are Agnostic and whose income is less than 10 thousand dollars, and there are 617 people who are Catholic and whose income is between 10 and 20 thousand dollars.

The gather() function can transform data from wide to long format.

> library(tidyr)
> religionLong <- gather(data = religion, key = IncomeLevel, 
+                        value = Frequency, 2:11)
> head(religionLong)

            religion IncomeLevel Frequency
1           Agnostic    under10k        27
2            Atheist    under10k        12
3           Buddhist    under10k        27
4           Catholic    under10k       418
5 DoNotKnowOrRefused    under10k        15
6    EvangelicalProt    under10k       575

> tail(religionLong)

               religion        IncomeLevel Frequency
175              Muslim DoNotKnowOrRefused        22
176            Orthodox DoNotKnowOrRefused        73
177      OtherChristian DoNotKnowOrRefused        18
178         OtherFaiths DoNotKnowOrRefused        71
179 OtherWorldReligions DoNotKnowOrRefused         8
180        Unaffiliated DoNotKnowOrRefused       597

To use gather() we specified the data frame (data = religion), the name we want to give to the column created from the income levels (key = IncomeLevel), the name we want to give to the column containing the frequency values (value = Frequency) and the columns to gather (2:11).

Columns to be gathered can be specified by name also, and we can also specify which columns should be omitted using a negative sign in front of the name(s). So the following creates an equivalent data frame:

> religionLong <- gather(data = religion, key = IncomeLevel, 
+                        value = Frequency, -religion)
> head(religionLong)

            religion IncomeLevel Frequency
1           Agnostic    under10k        27
2            Atheist    under10k        12
3           Buddhist    under10k        27
4           Catholic    under10k       418
5 DoNotKnowOrRefused    under10k        15
6    EvangelicalProt    under10k       575

> religionWide <- spread(data = religionLong, key = IncomeLevel, 
+   value = Frequency)
> head(religionWide)

            religion btw100and150k btw10and20k
1           Agnostic           109          34
2            Atheist            59          27
3           Buddhist            39          21
4           Catholic           792         617
5 DoNotKnowOrRefused            17          14
6    EvangelicalProt           723         869
  btw20and30k btw30and40k btw40and50k btw50and75k
1          60          81          76         137
2          37          52          35          70
3          30          34          33          58
4         732         670         638        1116
5          15          11          10          35
6        1064         982         881        1486
  btw75and100k DoNotKnowOrRefused over150k under10k
1          122                 96       84       27
2           73                 76       74       12
3           62                 54       53       27
4          949               1489      633      418
5           21                116       18       15
6          949               1529      414      575

Here we specify the data frame (religionLong), the column (IncomeLevel) to be spread, and the column of values (Frequency) to be spread among the newly created columns. As can be seen, this particular call to spread() yields the original data frame.

tidyr provides two other useful functions to separate and unite variables based on some deliminator. Consider again the yearlyIncomeWide table. Say we want to split the name variable into first and last name. This can be done using the separate() function.

> firstLast <- separate(data = yearlyIncomeLong, col = name, 
+                       into = c("first", "last"), sep="\\s")
> print(firstLast)

   first  last       year income
1   John Smith income1990  29784
2   Jane   Doe income1990  56789
3 Albert Jones income1990   2341
4   John Smith income2000  39210
5   Jane   Doe income2000  89321
6 Albert Jones income2000  34567
7   John Smith income2010  41213
8   Jane   Doe income2010 109321
9 Albert Jones income2010  56781

Now say, you’re not happy with that and you want to combine the name column again, but this time separate the first and last name with a underscore. This is done using the unite() function.

> unite(firstLast, col = name, first, last, sep = "_")

          name       year income
1   John_Smith income1990  29784
2     Jane_Doe income1990  56789
3 Albert_Jones income1990   2341
4   John_Smith income2000  39210
5     Jane_Doe income2000  89321
6 Albert_Jones income2000  34567
7   John_Smith income2010  41213
8     Jane_Doe income2010 109321
9 Albert_Jones income2010  56781

6.6.2 Practice Problem

The separate() function from the tidyr library is especially useful when working with real data as multiple pieces of information can be combined into one column in a data set. For example, consider the following data frame

> data.birds

  birds recordingInfo
1    10 LA-2017-01-01
2    38 DF-2011-03-02
3    29 OG-2078-05-11
4    88 YA-2000-11-18
5    42 LA-2019-03-17
6   177 OG-2016-10-10
7   200 YA-2001-03-22

“RecordingInfo” that contains the site where data was collected, as well as the year, month, and date the data was recorded. The data are coded as follows: site-year-month-day. Write a line of code that will extract the desired data from the data.birds data frame into separate columns named site, year, month and day.

6.7 Manipulating Data with `dplyr`

Much of the effort (a figure of 80% is sometimes suggested) in data analysis is spent cleaning the data and getting it ready for analysis. Having effective tools for this task can save substantial time and effort. The R package dplyr written by Hadley Wickham is designed, in Hadley’s words, to be “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Casting data analysis tasks in terms of “grammar” should be familiar from our work with the ggplot2 package, which was also authored by Hadley. Functions provided by dplyr do in fact capture key data analysis actions (i.e., verbs). These functions include

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names
filter() picks cases based on their values
summarize() reduces multiple values down to a single summary
arrange() changes the ordering of the rows.

These all combine naturally with a group_by() function that allows you to perform any operation grouped by values of one or more variables. All the tasks done using dplyr can be accomplished using tools already covered in this text; however, dplyr’s functions provide a potentially more efficient and convenient framework to accomplish these tasks. RStudio provides a convenient data wrangling cheat sheet that covers many aspects of the tidyr and dplyr packages.

This somewhat long section on dplyr adapts the nice introduction by Jenny Bryan, available at http://stat545-ubc.github.io/block010_dplyr-end-single-table.html.

6.7.1 Improved Data Frames

The dplyr package provides a couple functions that offer improvements on data frames. First, tibble creates a tibble from a series of vectors³⁴. A tibble has two advantages over a data frame. First, when printing, it only prints the first ten rows and the columns that fit on the page, as well as some additional information about the table’s dimension, data type of variables, and non-printed columns. Second, recall that subsetting a data frame can sometimes return a vector rather than a data frame (if only one row or column is the result of the subset), a tibble does not have this behavior. Second, we can use the as_tibble() function to convert an existing data frame or list into a tibble. Here is an example using the religionWide data frame.³⁵

> library(dplyr)
> head(religionWide)

            religion btw100and150k btw10and20k
1           Agnostic           109          34
2            Atheist            59          27
3           Buddhist            39          21
4           Catholic           792         617
5 DoNotKnowOrRefused            17          14
6    EvangelicalProt           723         869
  btw20and30k btw30and40k btw40and50k btw50and75k
1          60          81          76         137
2          37          52          35          70
3          30          34          33          58
4         732         670         638        1116
5          15          11          10          35
6        1064         982         881        1486
  btw75and100k DoNotKnowOrRefused over150k under10k
1          122                 96       84       27
2           73                 76       74       12
3           62                 54       53       27
4          949               1489      633      418
5           21                116       18       15
6          949               1529      414      575

> religionWide[, 1]

 [1] "Agnostic"              "Atheist"              
 [3] "Buddhist"              "Catholic"             
 [5] "DoNotKnowOrRefused"    "EvangelicalProt"      
 [7] "Hindu"                 "HistoricallyBlackProt"
 [9] "JehovahsWitness"       "Jewish"               
[11] "MainlineProt"          "Mormon"               
[13] "Muslim"                "Orthodox"             
[15] "OtherChristian"        "OtherFaiths"          
[17] "OtherWorldReligions"   "Unaffiliated"

> religionWideTbl <- as_tibble(religionWide)
> head(religionWideTbl)

# A tibble: 6 x 11
  religion btw100and150k btw10and20k btw20and30k
  <chr>            <int>       <int>       <int>
1 Agnostic           109          34          60
2 Atheist             59          27          37
3 Buddhist            39          21          30
4 Catholic           792         617         732
5 DoNotKn…            17          14          15
6 Evangel…           723         869        1064
# … with 7 more variables: btw30and40k <int>,
#   btw40and50k <int>, btw50and75k <int>,
#   btw75and100k <int>, DoNotKnowOrRefused <int>,
#   over150k <int>, under10k <int>

> religionWideTbl[, 1]

# A tibble: 18 x 1
   religion             
   <chr>                
 1 Agnostic             
 2 Atheist              
 3 Buddhist             
 4 Catholic             
 5 DoNotKnowOrRefused   
 6 EvangelicalProt      
 7 Hindu                
 8 HistoricallyBlackProt
 9 JehovahsWitness      
10 Jewish               
11 MainlineProt         
12 Mormon               
13 Muslim               
14 Orthodox             
15 OtherChristian       
16 OtherFaiths          
17 OtherWorldReligions  
18 Unaffiliated

As seen above, note that once the data frame is reduced to one dimension by subsetting to one column, it is no longer a data frame and has been simplified to a vector. This might not seem like a big deal; however, it can be very frustrating and potentially break your code when you expect an object to behave like a data frame and it doesn’t because it’s now a vector. Alternatively, once we convert religionWide to a tibble via the as_tibble() function the object remains a data frame even when subsetting down to one dimension (there is no automatic simplification). Converting data frames using as_tibble() is not required for using dplyr but is convenient. Also, it is important to note that tibble is simply a wrapper around a data frame that provides some additional behaviors. The newly formed tibble object will still behave like a data frame (because it technically still is a data frame) but will have some added niceties (some of which are illustrated below).

6.7.2 Filtering Data by Row

Recall the gapminder data. These data are available in tab-separated format in gapminder.tsv, and can be read in using read.delim() (or the related read functions described previously). The read.delim() function defaults to header = TRUE so this doesn’t need to be specified explicitly. In this section we will be working with the gapminder data often, so we will use a short name for the data frame to save typing.

> u.gm <- "https://www.finley-lab.com/files/data/gapminder.tsv"
> gm <- read.delim(u.gm)
> gm <- as_tibble(gm)
> str(gm)

Classes 'tbl_df', 'tbl' and 'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

> head(gm)

# A tibble: 6 x 6
  country      year     pop continent lifeExp gdpPercap
  <fct>       <int>   <dbl> <fct>       <dbl>     <dbl>
1 Afghanistan  1952  8.43e6 Asia         28.8      779.
2 Afghanistan  1957  9.24e6 Asia         30.3      821.
3 Afghanistan  1962  1.03e7 Asia         32.0      853.
4 Afghanistan  1967  1.15e7 Asia         34.0      836.
5 Afghanistan  1972  1.31e7 Asia         36.1      740.
6 Afghanistan  1977  1.49e7 Asia         38.4      786.

Filtering helps us to examine subsets of the data such as data from a particular country, from several specified countries, from certain years, from countries with certain populations, etc. Some examples:

> filter(gm, country == "Brazil")

# A tibble: 12 x 6
   country  year       pop continent lifeExp gdpPercap
   <fct>   <int>     <dbl> <fct>       <dbl>     <dbl>
 1 Brazil   1952  56602560 Americas     50.9     2109.
 2 Brazil   1957  65551171 Americas     53.3     2487.
 3 Brazil   1962  76039390 Americas     55.7     3337.
 4 Brazil   1967  88049823 Americas     57.6     3430.
 5 Brazil   1972 100840058 Americas     59.5     4986.
 6 Brazil   1977 114313951 Americas     61.5     6660.
 7 Brazil   1982 128962939 Americas     63.3     7031.
 8 Brazil   1987 142938076 Americas     65.2     7807.
 9 Brazil   1992 155975974 Americas     67.1     6950.
10 Brazil   1997 168546719 Americas     69.4     7958.
11 Brazil   2002 179914212 Americas     71.0     8131.
12 Brazil   2007 190010647 Americas     72.4     9066.

> filter(gm, country %in% c("Brazil", "Mexico"))

# A tibble: 24 x 6
   country  year       pop continent lifeExp gdpPercap
   <fct>   <int>     <dbl> <fct>       <dbl>     <dbl>
 1 Brazil   1952  56602560 Americas     50.9     2109.
 2 Brazil   1957  65551171 Americas     53.3     2487.
 3 Brazil   1962  76039390 Americas     55.7     3337.
 4 Brazil   1967  88049823 Americas     57.6     3430.
 5 Brazil   1972 100840058 Americas     59.5     4986.
 6 Brazil   1977 114313951 Americas     61.5     6660.
 7 Brazil   1982 128962939 Americas     63.3     7031.
 8 Brazil   1987 142938076 Americas     65.2     7807.
 9 Brazil   1992 155975974 Americas     67.1     6950.
10 Brazil   1997 168546719 Americas     69.4     7958.
# … with 14 more rows

> filter(gm, country %in% c("Brazil", "Mexico") & year %in% 
+   c(1952, 1972))

# A tibble: 4 x 6
  country  year       pop continent lifeExp gdpPercap
  <fct>   <int>     <dbl> <fct>       <dbl>     <dbl>
1 Brazil   1952  56602560 Americas     50.9     2109.
2 Brazil   1972 100840058 Americas     59.5     4986.
3 Mexico   1952  30144317 Americas     50.8     3478.
4 Mexico   1972  55984294 Americas     62.4     6809.

> filter(gm, pop > 300000000)

# A tibble: 25 x 6
   country  year        pop continent lifeExp gdpPercap
   <fct>   <int>      <dbl> <fct>       <dbl>     <dbl>
 1 China    1952     5.56e8 Asia         44        400.
 2 China    1957     6.37e8 Asia         50.5      576.
 3 China    1962     6.66e8 Asia         44.5      488.
 4 China    1967     7.55e8 Asia         58.4      613.
 5 China    1972     8.62e8 Asia         63.1      677.
 6 China    1977     9.43e8 Asia         64.0      741.
 7 China    1982     1.00e9 Asia         65.5      962.
 8 China    1987     1.08e9 Asia         67.3     1379.
 9 China    1992     1.16e9 Asia         68.7     1656.
10 China    1997     1.23e9 Asia         70.4     2289.
# … with 15 more rows

> filter(gm, pop > 300000000 & year == 2007)

# A tibble: 3 x 6
  country     year      pop continent lifeExp gdpPercap
  <fct>      <int>    <dbl> <fct>       <dbl>     <dbl>
1 China       2007   1.32e9 Asia         73.0     4959.
2 India       2007   1.11e9 Asia         64.7     2452.
3 United St…  2007   3.01e8 Americas     78.2    42952.

Notice the full results are not printed. For example, when we asked for the data for Brazil and Mexico, only the first ten rows were printed. This is an effect of using the as_tibble() function. Of course if we wanted to analyze the results (as we will below) the full set of data would be available.

6.7.3 Selecting variables by column

Continuing with the gapminder data, another common task is to restrict attention to some subset of variables in the data set. The select() function does this.

> select(gm, country, year, lifeExp)

# A tibble: 1,704 x 3
   country      year lifeExp
   <fct>       <int>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# … with 1,694 more rows

> select(gm, 2:4)

# A tibble: 1,704 x 3
    year      pop continent
   <int>    <dbl> <fct>    
 1  1952  8425333 Asia     
 2  1957  9240934 Asia     
 3  1962 10267083 Asia     
 4  1967 11537966 Asia     
 5  1972 13079460 Asia     
 6  1977 14880372 Asia     
 7  1982 12881816 Asia     
 8  1987 13867957 Asia     
 9  1992 16317921 Asia     
10  1997 22227415 Asia     
# … with 1,694 more rows

> select(gm, -c(2, 3, 4))

# A tibble: 1,704 x 3
   country     lifeExp gdpPercap
   <fct>         <dbl>     <dbl>
 1 Afghanistan    28.8      779.
 2 Afghanistan    30.3      821.
 3 Afghanistan    32.0      853.
 4 Afghanistan    34.0      836.
 5 Afghanistan    36.1      740.
 6 Afghanistan    38.4      786.
 7 Afghanistan    39.9      978.
 8 Afghanistan    40.8      852.
 9 Afghanistan    41.7      649.
10 Afghanistan    41.8      635.
# … with 1,694 more rows

> select(gm, starts_with("c"))

# A tibble: 1,704 x 2
   country     continent
   <fct>       <fct>    
 1 Afghanistan Asia     
 2 Afghanistan Asia     
 3 Afghanistan Asia     
 4 Afghanistan Asia     
 5 Afghanistan Asia     
 6 Afghanistan Asia     
 7 Afghanistan Asia     
 8 Afghanistan Asia     
 9 Afghanistan Asia     
10 Afghanistan Asia     
# … with 1,694 more rows

Notice a few things. Variables can be selected by name or column number. As usual, a negative sign tells R to leave something out. And there are special functions such as starts_with that provide ways to match part of a variable’s name.

6.7.4 Practice Problem

Use the contains() function to select only the columns that contain a c

6.7.5 Pipes

Consider selecting the country, year, and population for countries in Asia or Europe. One possibility is to nest a filter() function inside a select() function.

> select(filter(gm, continent %in% c("Asia", "Europe")), country, 
+   year, pop)

# A tibble: 756 x 3
   country      year      pop
   <fct>       <int>    <dbl>
 1 Afghanistan  1952  8425333
 2 Afghanistan  1957  9240934
 3 Afghanistan  1962 10267083
 4 Afghanistan  1967 11537966
 5 Afghanistan  1972 13079460
 6 Afghanistan  1977 14880372
 7 Afghanistan  1982 12881816
 8 Afghanistan  1987 13867957
 9 Afghanistan  1992 16317921
10 Afghanistan  1997 22227415
# … with 746 more rows

Even a two-step process like this becomes hard to follow in this nested form, and often we will want to perform more than two operations. There is a nice feature in dplyr that allows us to “feed” results of one function into the first argument of a subsequent function. Another way of saying this is that we are “piping” the results into another function. The %>% operator does the piping. Here we again restrict attention to country, year, and population for countries in Asia or Europe³⁶.

> gm %>% 
+   filter(continent %in% c("Asia", "Europe")) %>% 
+   select(country, year, pop)

# A tibble: 756 x 3
   country      year      pop
   <fct>       <int>    <dbl>
 1 Afghanistan  1952  8425333
 2 Afghanistan  1957  9240934
 3 Afghanistan  1962 10267083
 4 Afghanistan  1967 11537966
 5 Afghanistan  1972 13079460
 6 Afghanistan  1977 14880372
 7 Afghanistan  1982 12881816
 8 Afghanistan  1987 13867957
 9 Afghanistan  1992 16317921
10 Afghanistan  1997 22227415
# … with 746 more rows

It can help to think of %>% as representing the word “then”. The above can be read as, “Start with the data frame gm then filter it to select data from the continents Asia and Europe then select the variables country, year, and population from these data”.

The pipe operator %>% is not restricted to functions in dplyr. In fact the pipe operator itself was introduced in another package called magrittr, but is included in dplyr as a convenience.

6.7.6 Arranging Data by Row

By default the gapminder data are arranged by country and then by year.

> head(gm, 15)

# A tibble: 15 x 6
   country     year     pop continent lifeExp gdpPercap
   <fct>      <int>   <dbl> <fct>       <dbl>     <dbl>
 1 Afghanist…  1952  8.43e6 Asia         28.8      779.
 2 Afghanist…  1957  9.24e6 Asia         30.3      821.
 3 Afghanist…  1962  1.03e7 Asia         32.0      853.
 4 Afghanist…  1967  1.15e7 Asia         34.0      836.
 5 Afghanist…  1972  1.31e7 Asia         36.1      740.
 6 Afghanist…  1977  1.49e7 Asia         38.4      786.
 7 Afghanist…  1982  1.29e7 Asia         39.9      978.
 8 Afghanist…  1987  1.39e7 Asia         40.8      852.
 9 Afghanist…  1992  1.63e7 Asia         41.7      649.
10 Afghanist…  1997  2.22e7 Asia         41.8      635.
11 Afghanist…  2002  2.53e7 Asia         42.1      727.
12 Afghanist…  2007  3.19e7 Asia         43.8      975.
13 Albania     1952  1.28e6 Europe       55.2     1601.
14 Albania     1957  1.48e6 Europe       59.3     1942.
15 Albania     1962  1.73e6 Europe       64.8     2313.

Possibly arranging the data by year and then country would be desired. The arrange() function makes this easy. We will again use pipes.

> gm %>% arrange(year, country)

# A tibble: 1,704 x 6
   country     year     pop continent lifeExp gdpPercap
   <fct>      <int>   <dbl> <fct>       <dbl>     <dbl>
 1 Afghanist…  1952  8.43e6 Asia         28.8      779.
 2 Albania     1952  1.28e6 Europe       55.2     1601.
 3 Algeria     1952  9.28e6 Africa       43.1     2449.
 4 Angola      1952  4.23e6 Africa       30.0     3521.
 5 Argentina   1952  1.79e7 Americas     62.5     5911.
 6 Australia   1952  8.69e6 Oceania      69.1    10040.
 7 Austria     1952  6.93e6 Europe       66.8     6137.
 8 Bahrain     1952  1.20e5 Asia         50.9     9867.
 9 Bangladesh  1952  4.69e7 Asia         37.5      684.
10 Belgium     1952  8.73e6 Europe       68       8343.
# … with 1,694 more rows

How about the data for Rwanda, arranged in order of life expectancy.

> gm %>% 
+   filter(country == "Rwanda") %>% 
+   arrange(lifeExp)

# A tibble: 12 x 6
   country  year     pop continent lifeExp gdpPercap
   <fct>   <int>   <dbl> <fct>       <dbl>     <dbl>
 1 Rwanda   1992 7290203 Africa       23.6      737.
 2 Rwanda   1997 7212583 Africa       36.1      590.
 3 Rwanda   1952 2534927 Africa       40        493.
 4 Rwanda   1957 2822082 Africa       41.5      540.
 5 Rwanda   1962 3051242 Africa       43        597.
 6 Rwanda   2002 7852401 Africa       43.4      786.
 7 Rwanda   1987 6349365 Africa       44.0      848.
 8 Rwanda   1967 3451079 Africa       44.1      511.
 9 Rwanda   1972 3992121 Africa       44.6      591.
10 Rwanda   1977 4657072 Africa       45        670.
11 Rwanda   1982 5507565 Africa       46.2      882.
12 Rwanda   2007 8860588 Africa       46.2      863.

Possibly we want these data to be in decreasing (descending) order. Here, desc() is one of many dplyr helper functions.

> gm %>% 
+   filter(country == "Rwanda") %>% 
+   arrange(desc(lifeExp))

# A tibble: 12 x 6
   country  year     pop continent lifeExp gdpPercap
   <fct>   <int>   <dbl> <fct>       <dbl>     <dbl>
 1 Rwanda   2007 8860588 Africa       46.2      863.
 2 Rwanda   1982 5507565 Africa       46.2      882.
 3 Rwanda   1977 4657072 Africa       45        670.
 4 Rwanda   1972 3992121 Africa       44.6      591.
 5 Rwanda   1967 3451079 Africa       44.1      511.
 6 Rwanda   1987 6349365 Africa       44.0      848.
 7 Rwanda   2002 7852401 Africa       43.4      786.
 8 Rwanda   1962 3051242 Africa       43        597.
 9 Rwanda   1957 2822082 Africa       41.5      540.
10 Rwanda   1952 2534927 Africa       40        493.
11 Rwanda   1997 7212583 Africa       36.1      590.
12 Rwanda   1992 7290203 Africa       23.6      737.

Possibly we want to include only the year and life expectancy, to make the message more stark.

> gm %>% 
+   filter(country == "Rwanda") %>% 
+   select(year, lifeExp) %>% 
+   arrange(desc(lifeExp))

# A tibble: 12 x 2
    year lifeExp
   <int>   <dbl>
 1  2007    46.2
 2  1982    46.2
 3  1977    45  
 4  1972    44.6
 5  1967    44.1
 6  1987    44.0
 7  2002    43.4
 8  1962    43  
 9  1957    41.5
10  1952    40  
11  1997    36.1
12  1992    23.6

For analyzing data in R, the order shouldn’t matter. But for presentation to human eyes, the order is important.

6.7.7 Practice Problem

It is worth your while to get comfortable with using pipes. Here is some hard-to-read code. Convert it into more easier to read code by using pipes.

> arrange(select(filter(gm, country == "Afghanistan"), c("year", 
+   "lifeExp")), desc(lifeExp))

# A tibble: 12 x 2
    year lifeExp
   <int>   <dbl>
 1  2007    43.8
 2  2002    42.1
 3  1997    41.8
 4  1992    41.7
 5  1987    40.8
 6  1982    39.9
 7  1977    38.4
 8  1972    36.1
 9  1967    34.0
10  1962    32.0
11  1957    30.3
12  1952    28.8

6.7.8 Renaming Variables

The dplyr package has a rename function that makes renaming variables in a data frame quite easy.

> gm <- rename(gm, population = pop)
> head(gm)

# A tibble: 6 x 6
  country   year population continent lifeExp gdpPercap
  <fct>    <int>      <dbl> <fct>       <dbl>     <dbl>
1 Afghani…  1952    8425333 Asia         28.8      779.
2 Afghani…  1957    9240934 Asia         30.3      821.
3 Afghani…  1962   10267083 Asia         32.0      853.
4 Afghani…  1967   11537966 Asia         34.0      836.
5 Afghani…  1972   13079460 Asia         36.1      740.
6 Afghani…  1977   14880372 Asia         38.4      786.

6.7.9 Data Summaries and Grouping

The summarize() function computes summary statistics using user provided functions for one or more columns of data in a data frame.

> summarize(gm, meanpop = mean(population), medpop = median(population))

# A tibble: 1 x 2
    meanpop   medpop
      <dbl>    <dbl>
1 29601212. 7023596.

> ##or
> gm %>% 
+   summarize(meanpop = mean(population), medpop = median(population))

# A tibble: 1 x 2
    meanpop   medpop
      <dbl>    <dbl>
1 29601212. 7023596.

Often we want summaries for specific components of the data. For example, we might want the median life expectancy for each continent separately. One option is subsetting:

> median(gm$lifeExp[gm$continent == "Africa"])

[1] 47.79

> median(gm$lifeExp[gm$continent == "Asia"])

[1] 61.79

> median(gm$lifeExp[gm$continent == "Europe"])

[1] 72.24

> median(gm$lifeExp[gm$continent == "Americas"])

[1] 67.05

> median(gm$lifeExp[gm$continent == "Oceania"])

[1] 73.66

The group_by() function makes this easier, and makes the output more useful.

> gm %>% 
+   group_by(continent) %>% 
+   summarize(medLifeExp = median(lifeExp))

# A tibble: 5 x 2
  continent medLifeExp
  <fct>          <dbl>
1 Africa          47.8
2 Americas        67.0
3 Asia            61.8
4 Europe          72.2
5 Oceania         73.7

Or if we want the results ordered by the median life expectancy:

> gm %>% 
+   group_by(continent) %>% 
+   summarize(medLifeExp = median(lifeExp)) %>% 
+   arrange(medLifeExp)

# A tibble: 5 x 2
  continent medLifeExp
  <fct>          <dbl>
1 Africa          47.8
2 Asia            61.8
3 Americas        67.0
4 Europe          72.2
5 Oceania         73.7

As another example, we calculate the number of observations we have per continent (using the n() helper function), and then, among continents, how many distinct countries are represented (using n_distinct()).

> gm %>% 
+   group_by(continent) %>% 
+   summarize(numObs = n())

# A tibble: 5 x 2
  continent numObs
  <fct>      <int>
1 Africa       624
2 Americas     300
3 Asia         396
4 Europe       360
5 Oceania       24

> gm %>%
+   group_by(continent) %>%
+   summarize(n_obs = n(), n_countries = n_distinct(country))

# A tibble: 5 x 3
  continent n_obs n_countries
  <fct>     <int>       <int>
1 Africa      624          52
2 Americas    300          25
3 Asia        396          33
4 Europe      360          30
5 Oceania      24           2

Here is a bit more involved example that calculates the minimum and maximum life expectancies for countries in Africa by year.

> gm %>%
+   filter(continent == "Africa") %>%
+   group_by(year) %>%
+   summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))

# A tibble: 12 x 3
    year min_lifeExp max_lifeExp
   <int>       <dbl>       <dbl>
 1  1952        30          52.7
 2  1957        31.6        58.1
 3  1962        32.8        60.2
 4  1967        34.1        61.6
 5  1972        35.4        64.3
 6  1977        36.8        67.1
 7  1982        38.4        69.9
 8  1987        39.9        71.9
 9  1992        23.6        73.6
10  1997        36.1        74.8
11  2002        39.2        75.7
12  2007        39.6        76.4

This is interesting, but the results don’t include the countries that achieved the minimum and maximum life expectancies. Here is one way to achieve that. We will start with the minimum life expectancy. Note the rank of the minimum value will be 1.

> gm %>% 
+   select(country, continent, year, lifeExp) %>% 
+   group_by(year) %>% 
+   arrange(year) %>% 
+   filter(rank(lifeExp) == 1)

# A tibble: 12 x 4
# Groups:   year [12]
   country      continent  year lifeExp
   <fct>        <fct>     <int>   <dbl>
 1 Afghanistan  Asia       1952    28.8
 2 Afghanistan  Asia       1957    30.3
 3 Afghanistan  Asia       1962    32.0
 4 Afghanistan  Asia       1967    34.0
 5 Sierra Leone Africa     1972    35.4
 6 Cambodia     Asia       1977    31.2
 7 Sierra Leone Africa     1982    38.4
 8 Angola       Africa     1987    39.9
 9 Rwanda       Africa     1992    23.6
10 Rwanda       Africa     1997    36.1
11 Zambia       Africa     2002    39.2
12 Swaziland    Africa     2007    39.6

Next we add the maximum life expectancy. Here we need to better understand the desc() function, which will transform a vector into a numeric vector which will be sorted in descending order. Here are some examples.

> desc(1:5)

[1] -1 -2 -3 -4 -5

> desc(c(2, 3, 1, 5, 6, -4))

[1] -2 -3 -1 -5 -6  4

> desc(c("a", "c", "b", "w", "e"))

[1] -1 -3 -2 -5 -4

We now use this to extract the maximum life expectancy. Recall that | represents “or”. Also by default only the first few rows of a tibble object will be printed. To see all the rows we pipe the output to print(n = 24) to ask for all 24 rows to be printed.

> gm %>% 
+   select(country, continent, year, lifeExp) %>% 
+   group_by(year) %>% 
+   arrange(year) %>% 
+   filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>% 
+   print(n=24)

# A tibble: 24 x 4
# Groups:   year [12]
   country      continent  year lifeExp
   <fct>        <fct>     <int>   <dbl>
 1 Afghanistan  Asia       1952    28.8
 2 Norway       Europe     1952    72.7
 3 Afghanistan  Asia       1957    30.3
 4 Iceland      Europe     1957    73.5
 5 Afghanistan  Asia       1962    32.0
 6 Iceland      Europe     1962    73.7
 7 Afghanistan  Asia       1967    34.0
 8 Sweden       Europe     1967    74.2
 9 Sierra Leone Africa     1972    35.4
10 Sweden       Europe     1972    74.7
11 Cambodia     Asia       1977    31.2
12 Iceland      Europe     1977    76.1
13 Japan        Asia       1982    77.1
14 Sierra Leone Africa     1982    38.4
15 Angola       Africa     1987    39.9
16 Japan        Asia       1987    78.7
17 Japan        Asia       1992    79.4
18 Rwanda       Africa     1992    23.6
19 Japan        Asia       1997    80.7
20 Rwanda       Africa     1997    36.1
21 Japan        Asia       2002    82  
22 Zambia       Africa     2002    39.2
23 Japan        Asia       2007    82.6
24 Swaziland    Africa     2007    39.6

6.7.10 Creating New Variables

The $ notation provides a simple way to create new variables in a data frame. The mutate() function provides another, sometimes cleaner way to do this. We will use mutate() along with the lag() function to investigate changes in life expectancy over five years for the gapminder data. We’ll do this in a few steps. First, we create a variable that measures the change in life expectancy and remove the population and GDP variables that are not of interest. We have to be careful to first group by country, since we want to calculate the change in life expectancy by country.

> gm %>% 
+   group_by(country) %>% 
+   mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
+   select(-c(population, gdpPercap))

# A tibble: 1,704 x 5
# Groups:   country [142]
   country      year continent lifeExp changeLifeExp
   <fct>       <int> <fct>       <dbl>         <dbl>
 1 Afghanistan  1952 Asia         28.8       NA     
 2 Afghanistan  1957 Asia         30.3        1.53  
 3 Afghanistan  1962 Asia         32.0        1.66  
 4 Afghanistan  1967 Asia         34.0        2.02  
 5 Afghanistan  1972 Asia         36.1        2.07  
 6 Afghanistan  1977 Asia         38.4        2.35  
 7 Afghanistan  1982 Asia         39.9        1.42  
 8 Afghanistan  1987 Asia         40.8        0.968 
 9 Afghanistan  1992 Asia         41.7        0.852 
10 Afghanistan  1997 Asia         41.8        0.0890
# … with 1,694 more rows

Next, summarize by computing the largest drop in life expectancy.

> gm %>% 
+   group_by(country) %>% 
+   mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
+   select(-c(population, gdpPercap)) %>% 
+   summarize(largestDropLifeExp = min(changeLifeExp))

# A tibble: 142 x 2
   country     largestDropLifeExp
   <fct>                    <dbl>
 1 Afghanistan                 NA
 2 Albania                     NA
 3 Algeria                     NA
 4 Angola                      NA
 5 Argentina                   NA
 6 Australia                   NA
 7 Austria                     NA
 8 Bahrain                     NA
 9 Bangladesh                  NA
10 Belgium                     NA
# … with 132 more rows

Oops. We forgot that since we don’t have data from before 1952, the first drop will be NA. Let’s try again.

> gm %>% 
+   group_by(country) %>% 
+   mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
+   select(-c(population, gdpPercap)) %>% 
+   summarize(largestDropLifeExp = min(changeLifeExp, na.rm = TRUE))

# A tibble: 142 x 2
   country     largestDropLifeExp
   <fct>                    <dbl>
 1 Afghanistan             0.0890
 2 Albania                -0.419 
 3 Algeria                 1.31  
 4 Angola                 -0.036 
 5 Argentina               0.492 
 6 Australia               0.170 
 7 Austria                 0.490 
 8 Bahrain                 0.84  
 9 Bangladesh              1.67  
10 Belgium                 0.5   
# … with 132 more rows

That’s not quite what we wanted. We could arrange the results by the life expectancy drop, but it would be good to have both the continent and year printed out also. So we’ll take a slightly different approach, by arranging the results in increasing order.

> gm %>% 
+   group_by(country) %>% 
+   mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
+   select(-c(population, gdpPercap)) %>% 
+   arrange(changeLifeExp)

# A tibble: 1,704 x 5
# Groups:   country [142]
   country       year continent lifeExp changeLifeExp
   <fct>        <int> <fct>       <dbl>         <dbl>
 1 Rwanda        1992 Africa       23.6        -20.4 
 2 Zimbabwe      1997 Africa       46.8        -13.6 
 3 Lesotho       2002 Africa       44.6        -11.0 
 4 Swaziland     2002 Africa       43.9        -10.4 
 5 Botswana      1997 Africa       52.6        -10.2 
 6 Cambodia      1977 Asia         31.2         -9.10
 7 Namibia       2002 Africa       51.5         -7.43
 8 South Africa  2002 Africa       53.4         -6.87
 9 Zimbabwe      2002 Africa       40.0         -6.82
10 China         1962 Asia         44.5         -6.05
# … with 1,694 more rows

That’s still not quite right. Because the data are grouped by country, R did the ordering within group. If we want to see the largest drops overall, we need to remove the grouping.

> gm %>% group_by(country) %>% mutate(changeLifeExp = lifeExp - 
+   lag(lifeExp, order_by = year)) %>% select(-c(population, 
+   gdpPercap)) %>% ungroup() %>% arrange(changeLifeExp) %>% 
+   print(n = 20)

# A tibble: 1,704 x 5
   country       year continent lifeExp changeLifeExp
   <fct>        <int> <fct>       <dbl>         <dbl>
 1 Rwanda        1992 Africa       23.6        -20.4 
 2 Zimbabwe      1997 Africa       46.8        -13.6 
 3 Lesotho       2002 Africa       44.6        -11.0 
 4 Swaziland     2002 Africa       43.9        -10.4 
 5 Botswana      1997 Africa       52.6        -10.2 
 6 Cambodia      1977 Asia         31.2         -9.10
 7 Namibia       2002 Africa       51.5         -7.43
 8 South Africa  2002 Africa       53.4         -6.87
 9 Zimbabwe      2002 Africa       40.0         -6.82
10 China         1962 Asia         44.5         -6.05
11 Botswana      2002 Africa       46.6         -5.92
12 Zambia        1997 Africa       40.2         -5.86
13 Iraq          1992 Asia         59.5         -5.58
14 Liberia       1992 Africa       40.8         -5.23
15 Cambodia      1972 Asia         40.3         -5.10
16 Kenya         1997 Africa       54.4         -4.88
17 Somalia       1992 Africa       39.7         -4.84
18 Zambia        1992 Africa       46.1         -4.72
19 Swaziland     2007 Africa       39.6         -4.26
20 Uganda        1997 Africa       44.6         -4.25
# … with 1,684 more rows

6.7.11 Practice Problem

As you progress in your data science related career, we are sure you will find dplyr as one of the most useful packages for your initial data exploration. Here is one more Practice Problem to get more comfortable with the syntax.

Recall the iris data set we have worked with multiple times. We want to look at the ratio of Sepal.Length to Petal.Length in the three different species. Write a series of dplyr statements that groups the data by species, creates a new column called s.p.ratio that is the Sepal.Length divided by Petal.Length, then computes the mean of this column for each species in a column called mean.ratio. Display the data in descending order of mean.ratio.

6.8 Exercises

Exercise 7 Learning objectives: introduce with(), tapply(), and cut() functions; summarize data using the table() function with logical subsetting; practice using factor data types.

Exercise 8 Learning objectives: work with messy data; import data from an external spreadsheet; practice using functions in tidyr and graphing functions.}

Exercise 9 {Learning objectives: work with several key dplyr functions; manipulate data frames (actually tibbles); summarize and visualize data from large data files.}

References

Spector, Phil. 2008. Data Manipulation with R. Use R! pub-sv:adr: pub-sv.

These data come from the MASS R library.↩
Reminds me of The Trouble with Tribbles ↩
The text printed immediately after library(dplyr) means the stats and base packages, which are automatically loaded when you start R, have functions with the same name as functions in dplyr. So, for example, if you call the filter() or lag() functions, R will use library(dplyr)‘s functions. Use the :: operator to explicity identify which packages’ function you want to use, e.g., if you want stats’s lag() then call stats::lag().↩
Notice the indentation used in the code. This is not necessary, as the code could be all on one line, but I often find it easier to read in this more organized format↩