Chapter 6 Working with Data
Bringing data into R, exporting data from R in a form that is readable by other software, cleaning and reshaping data, and other data manipulation tasks are an important and often overlooked component of data science. The book Spector (2008), while a few years old, is still an excellent reference for data-related issues. And the manual, available online at http://catalog.lib.msu.edu/record=b7254984~S39a , is an up-to-date (and free) reference on importing a wide variety of datasets into R and on exporting data in various forms.
6.1 Reading Data into R
Data come in a dizzying variety of forms. It might be in a proprietary format such as an .xlsx Excel file, a .sav SPSS file, or a .mtw Minitab file. It might be structured using a relational model comprising many tables that need to be connected via key-value pairs. It might be a data-interchange format such as JSON (JavaScript Object Notation), or a markup language such as XML (Extensible Markup Language), perhaps with specialized standards for describing ecological information, see EML (Ecological Metadata Language). Both XML and EML are common data metadata formats (i.e., data that provides information about other data). Fortunately many datasets are (or can be) saved as plain text files, and most software can both read and write such files, so our initial focus will be on reading plain text files into R and saving data from R in plain text format. RStudio provides a handy data import cheat sheet for many of the read functions detailed in this section.
The foreign
R package provides functions to directly read data saved in some of the proprietary formats into R, which is sometimes unavoidable, but if possible it is good to save data from another package as plain text and then read this plain text file into R. In Chapter 10 methods for reading web-based data sets into R will be discussed.
The function read.table()
and its offshoots such as read.csv()
are used to read in rectangular data from a text file. For example, the file BrainAndBody.csv
contains data33 on the brain weight, body weight, and name of some terrestrial animals. Here are the first few lines of that file:
body,brain,name
1.35,8.1,Mountain beaver
465,423,Cow
36.33,119.5,Grey wolf
27.66,115,Goat
1.04,5.5,Guinea pig
As is evident, the first line of the file contains the names of the three variables, separated (delimited) by commas. Each subsequent line contains the body weight, brain weight, and name of a specific terrestrial animal.
This file is accessible at the url https://www.finley-lab.com/files/data/BrainAndBody.csv. The read.table()
function is used to read these data into an R data frame.
> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.csv"
> BrainBody <- read.table(file = u.bb, header = TRUE, sep = ",",
+ stringsAsFactors = FALSE)
> head(BrainBody)
body brain name
1 1.35 8.1 Mountain beaver
2 465.00 423.0 Cow
3 36.33 119.5 Grey wolf
4 27.66 115.0 Goat
5 1.04 5.5 Guinea pig
6 11700.00 50.0 Dipliodocus
The arguments used in this call to read.table()
include:
file = u.bb
, which tells R the location of the file. In this case the stringhttps://www.finley-lab.com/files/data/BrainAndBody.csv
giving the location is rather long, so it was first assigned to the objectu.bb
.header = TRUE
, which tells R the first line of the file gives the names of the variables.sep = ","
, which tells R that a comma separates the fields in the file.stringsAsFactors = FALSE
which tells R not to convert character vectors to factors.
The function read.csv()
is the same as read.table()
except the default separator is a comma, whereas the default separator for read.table()
is whitespace.
The file BrainAndBody.tsv
contains the same data, except a tab is used in place of a comma to separate fields. The only change needed to read in the data in this file is in the sep
argument (and of course the file
argument, since the data are stored in a different file):
> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.tsv"
> BrainBody2 <- read.table(file = u.bb, header = TRUE, sep = "\t",
+ stringsAsFactors = FALSE)
> head(BrainBody2)
body brain name
1 1.35 8.1 Mountain beaver
2 465.00 423.0 Cow
3 36.33 119.5 Grey wolf
4 27.66 115.0 Goat
5 1.04 5.5 Guinea pig
6 11700.00 50.0 Dipliodocus
File extensions, e.g., .csv
or .tsv
, are naming conventions only and are there to remind us how the columns are separated. In other words, they have no influence on R’s file read functions.
A third file, BrainAndBody.txt
, contains the same data, but also contains a few lines of explanatory text above the names of the variables. It also uses whitespace rather than a comma or a tab as a separator. Here are the first several lines of the file.
This file contains data
on brain and body
weights of several terrestrial animals
"body" "brain" "name"
1.35 8.1 "Mountain beaver"
465 423 "Cow"
36.33 119.5 "Grey wolf"
27.66 115 "Goat"
1.04 5.5 "Guinea pig"
11700 50 "Dipliodocus"
2547 4603 "Asian elephant"
Notice that in this file the values of name
are put inside of quotation marks. This is necessary since instead R would (reasonably) assume the first line contained the values of four variables, the values being 1.35
, 8.1
, Mountain
, and beaver
while in reality there are only three values desired, with Mountain beaver
being the third.
To read in this file we need to tell R to skip the first five lines and to use whitespace as the separator. The skip
argument handles the first, and the sep
argument the second. First let’s see what happens if we don’t use the skip
argument.
> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.txt"
> BrainBody3 <- read.table(u.bb, header = TRUE, sep = " ",
+ stringsAsFactors = FALSE)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 5 elements
R assumed the first line of the file contained the variable names, since header = TRUE
was specified, and counted four including This
, file
, contains
, and data
. So in the first line of actual data, R expected four columns containing data plus possibly a fifth column containing row names for the data set, and complained that “line 1 did not have 5 elements.” The error message is somewhat mysterious, since it starts with “Error in scan.” This happens because read.table()
actually uses a more basic R function called scan()
to do the work.
Here’s how to read in the file correctly.
> u.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.txt"
> BrainBody3 <- read.table(u.bb, header = TRUE, sep = " ",
+ stringsAsFactors = FALSE, skip = 4)
> BrainBody3[1:10,]
body brain name
1 1.35 8.1 Mountain beaver
2 465.00 423.0 Cow
3 36.33 119.5 Grey wolf
4 27.66 115.0 Goat
5 1.04 5.5 Guinea pig
6 11700.00 50.0 Dipliodocus
7 2547.00 4603.0 Asian elephant
8 187.10 419.0 Donkey
9 521.00 655.0 Horse
10 10.00 115.0 Potar monkey
6.2 Reading Data with Missing Observations
Missing data are represented in many ways. Sometimes a missing data point is just that, i.e., the place where it should be in the file is blank. Other times specific numbers such as \(-9999\) or specific symbols are used. The read.table()
function has an argument na.string
that allows the user to specify how missing data is indicated in the source file.
The site http://www.wunderground.com/history/ makes weather data available for locations around the world from dates going back to 1945. The file WeatherKLAN2014.csv
contains weather data for Lansing, Michigan for the year 2014. Here are the first few lines of that file:
EST,Max TemperatureF,Min TemperatureF, Events
1/1/14,14,9,Snow
1/2/14,13,-3,Snow
1/3/14,13,-11,Snow
1/4/14,31,13,Snow
1/5/14,29,16,Fog-Snow
1/6/14,16,-12,Fog-Snow
1/7/14,2,-13,Snow
1/8/14,17,-1,Snow
1/9/14,21,2,Snow
1/10/14,39,21,Fog-Rain-Snow
1/11/14,41,32,Fog-Rain
1/12/14,39,31,
Look at the last line, and notice that instead of an Event
such as Snow
or Fog-Snow
there is nothing after the comma. This observation is missing, but rather than using an explicit code such as NA
, the site just leaves that entry blank. To read these data into R we will supply the argument na.string = ""
which tells R the file indicates missing data by leaving the appropriate entry blank.
> u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014.csv"
> WeatherKLAN2014 <- read.csv(u.weather, header=TRUE,
+ stringsAsFactors = FALSE, na.string = "")
> WeatherKLAN2014[1:15,]
EST Max.TemperatureF Min.TemperatureF
1 1/1/14 14 9
2 1/2/14 13 -3
3 1/3/14 13 -11
4 1/4/14 31 13
5 1/5/14 29 16
6 1/6/14 16 -12
7 1/7/14 2 -13
8 1/8/14 17 -1
9 1/9/14 21 2
10 1/10/14 39 21
11 1/11/14 41 32
12 1/12/14 39 31
13 1/13/14 44 34
14 1/14/14 37 26
15 1/15/14 27 18
Events
1 Snow
2 Snow
3 Snow
4 Snow
5 Fog-Snow
6 Fog-Snow
7 Snow
8 Snow
9 Snow
10 Fog-Rain-Snow
11 Fog-Rain
12 <NA>
13 Rain
14 Rain-Snow
15 Snow
6.3 Summarizing Data Frames
Some common data tasks include variable summaries such as means or standard deviations, transforming an existing variable, and creating new variables. As with many tasks, there are several ways to accomplish each of these.
6.3.1 Column (and Row) Summaries
The file WeatherKLAN2014Full.csv
contains a more complete set of weather data variables than WeatherKLAN2014.csv
, from the same source, http://www.wunderground.com/history.
> u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
> WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE,
+ stringsAsFactors = FALSE,
+ na.string = "")
> names(WeatherKLAN2014Full)
[1] "EST"
[2] "Max.TemperatureF"
[3] "Mean.TemperatureF"
[4] "Min.TemperatureF"
[5] "Max.Dew.PointF"
[6] "MeanDew.PointF"
[7] "Min.DewpointF"
[8] "Max.Humidity"
[9] "Mean.Humidity"
[10] "Min.Humidity"
[11] "Max.Sea.Level.PressureIn"
[12] "Mean.Sea.Level.PressureIn"
[13] "Min.Sea.Level.PressureIn"
[14] "Max.VisibilityMiles"
[15] "Mean.VisibilityMiles"
[16] "Min.VisibilityMiles"
[17] "Max.Wind.SpeedMPH"
[18] "Mean.Wind.SpeedMPH"
[19] "Max.Gust.SpeedMPH"
[20] "PrecipitationIn"
[21] "CloudCover"
[22] "Events"
[23] "WindDirDegrees"
How can we compute the mean for each variable? One possibility is to do this a variable at a time:
[1] 45.78
[1] 36.25
[1] 54.84
This is pretty inefficient. Fortunately there is a colMeans()
function which computes the mean of each column (or a specified number of columns) in a data frame. Some columns in the current data frame are not numeric, and obviously we don’t want to ask R to compute means for these columns. We use str()
to investigate.
'data.frame': 365 obs. of 23 variables:
$ EST : chr "2014-1-1" "2014-1-2" "2014-1-3" "2014-1-4" ...
$ Max.TemperatureF : int 14 13 13 31 29 16 2 17 21 39 ...
$ Mean.TemperatureF : int 12 5 1 22 23 2 -5 8 12 30 ...
$ Min.TemperatureF : int 9 -3 -11 13 16 -12 -13 -1 2 21 ...
$ Max.Dew.PointF : int 9 7 2 27 27 11 -6 7 18 37 ...
$ MeanDew.PointF : int 4 4 -5 18 21 -4 -13 1 8 28 ...
$ Min.DewpointF : int 0 -8 -14 3 11 -18 -18 -6 0 19 ...
$ Max.Humidity : int 88 76 83 92 92 80 78 88 88 100 ...
$ Mean.Humidity : int 76 70 68 73 86 73 72 78 75 92 ...
$ Min.Humidity : int 63 63 53 53 80 65 65 67 62 84 ...
$ Max.Sea.Level.PressureIn : num 30.4 30.4 30.5 30.1 30 ...
$ Mean.Sea.Level.PressureIn: num 30.3 30.2 30.4 30 29.9 ...
$ Min.Sea.Level.PressureIn : num 30.2 30.1 30.1 29.9 29.7 ...
$ Max.VisibilityMiles : int 10 9 10 10 4 10 10 10 10 9 ...
$ Mean.VisibilityMiles : int 4 4 10 6 1 2 6 10 7 3 ...
$ Min.VisibilityMiles : int 1 0 5 1 0 0 1 8 2 0 ...
$ Max.Wind.SpeedMPH : int 17 22 23 28 22 31 25 15 14 17 ...
$ Mean.Wind.SpeedMPH : int 9 13 10 15 11 18 15 7 6 10 ...
$ Max.Gust.SpeedMPH : int 22 30 32 36 30 40 31 18 17 22 ...
$ PrecipitationIn : chr "0.08" "0.01" "0.00" "0.12" ...
$ CloudCover : int 8 7 1 5 8 8 6 5 7 8 ...
$ Events : chr "Snow" "Snow" "Snow" "Snow" ...
$ WindDirDegrees : int 43 24 205 203 9 262 220 236 147 160 ...
It isn’t surprising that EST
and Events
are not numeric, but is surprising that PrecipitationIn
, which measures precipitation in inches, also is not numeric, but is character. Let’s investigate further.
[1] "0.08" "0.01" "0.00" "0.12" "0.78" "0.07" "T"
[8] "T" "0.01" "0.39" "0.16" "0.00" "0.00" "0.01"
[15] "T" "0.08" "T" "T" "T" "0.01" "0.00"
[22] "0.05" "T" "T" "0.07" "0.23" "0.04" "T"
[29] "T" "0.03" "T" "0.37" "T" "0.00" "T"
[36] "0.27" "0.01" "T" "0.04" "0.03" "T" "0.00"
[43] "0.00" "T" "T" "0.00" "0.02" "0.15" "0.08"
[50] "0.01"
Now it’s more clear. The original data file included T
in the precipitation column to represent a “trace” of precipitation, which is precipitation greater than \(0\) but less than \(0.01\) inches. One possibility would be to set all these values to "0"
, and then to convert the column to numeric. For now we will just leave the PrecipitationIn
column out of the columns for which we request the mean.
Max.TemperatureF Mean.TemperatureF
54.838 45.781
Min.TemperatureF Max.Dew.PointF
36.255 41.800
MeanDew.PointF Min.DewpointF
36.395 30.156
Max.Humidity Mean.Humidity
88.082 70.392
Min.Humidity Max.Sea.Level.PressureIn
52.200 30.130
Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
30.015 29.904
Max.VisibilityMiles Mean.VisibilityMiles
9.896 8.249
Min.VisibilityMiles Max.Wind.SpeedMPH
4.825 19.101
Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
8.679 NA
CloudCover WindDirDegrees
4.367 205.000
6.3.2 The apply()
Function
R also has functions rowMeans()
, colSums()
, and rowSums()
. But what if we want to compute the median or standard deviation of columns of data, or some other summary statistic? For this the apply()
function can be used. This function applies a user-chosen function to either the rows or columns (or both) of a data frame. The arguments are:
X
: the data frame of interestMARGIN
: specifying either rows (MARGIN = 1
) or columns (MARGIN = 2
)FUN
: the function to be applied.
Max.TemperatureF Mean.TemperatureF
22.2130 20.9729
Min.TemperatureF Max.Dew.PointF
20.2597 19.5167
MeanDew.PointF Min.DewpointF
20.0311 20.8511
Max.Humidity Mean.Humidity
8.1910 9.3660
Min.Humidity Max.Sea.Level.PressureIn
13.9462 0.2032
Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
0.2159 0.2360
Max.VisibilityMiles Mean.VisibilityMiles
0.5790 2.1059
Min.VisibilityMiles Max.Wind.SpeedMPH
3.8168 6.4831
Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
3.8863 NA
CloudCover WindDirDegrees
2.7798 90.0673
As with any R function the arguments don’t need to be named as long as they are specified in the correct order, so
has the same result.
6.3.3 Practice Problems
- Notice the output value of
NA
for the columnMax.Gust.SpeedMPH
. Why does this happen? Figure out a way to make theapply()
function return a numeric value for this column. - The
apply()
family of functions is extremely important in R, so it gets two Practice Problems :) Use theapply()
function to compute the median values for all numeric columns in theiris
data set.
6.3.4 Saving Typing Using with()
Consider calculating the mean of the maximum temperature values for those days where the cloud cover is less than 4 and when the maximum humidity is over 85. We can do this using subsetting.
> mean(WeatherKLAN2014Full$Max.TemperatureF[
+ WeatherKLAN2014Full$CloudCover < 4 &
+ WeatherKLAN2014Full$Max.Humidity > 85])
[1] 69.39
While this works, it requires a lot of typing, since each time we refer to a variable in the data set we need to preface its name by WeatherKLAN2014Full$
. The with()
function tells R that we are working with a particular data frame, and we don’t need to keep typing the name of the data frame.
[1] 69.39
6.4 Transforming a Data Frame
Variables are often added to, removed from, changed in, or rearranged in a data frame. The subsetting features of R make this reasonably easy. We will investigate this in the context of the gapminder
data frame. If the gapminder
library is not yet installed, use install.packages("gapminder")
to install it locally.
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 346 364 384 408 433 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num 779 821 853 836 740 ...
6.4.1 Adding Variables
The data frame contains per capita GDP and population, and it might be interesting to create a variable that gives the total GDP by multiplying these two variables. (If we were interested in an accurate value for the total GDP we would probably be better off getting this information directly, since it is likely that the per capita GDP values in the data frame are rounded substantially.)
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 7 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 346 364 384 408 433 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num 779 821 853 836 740 ...
$ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
Analogous to the with()
function, there is a function within()
which can simplify the syntax. Whereas with()
does not change the data frame, within()
can. Note, below I first remove the altered gapminder dataframe using rm()
then bring a clean copy back in by reloading the gapminder
package.
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num 779 821 853 836 740 ...
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 7 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num 779 821 853 836 740 ...
$ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
A nice feature of within()
is its ability to add more than one variable at a time to a data frame. In this case the two or more formulas creating new variables must be enclosed in braces.
> gapminder <- within(gapminder, {TotalGDP <- gdpPercap * pop
+ lifeExpMonths <- lifeExp * 12})
> str(gapminder)
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 8 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap : num 779 821 853 836 740 ...
$ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
$ lifeExpMonths: num 346 364 384 408 433 ...
6.4.2 Removing Variables
After reflection we may realize the new variables we added to the gapminder
data frame are not useful, and should be removed.
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 8 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap : num 779 821 853 836 740 ...
$ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
$ lifeExpMonths: num 346 364 384 408 433 ...
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num 779 821 853 836 740 ...
The same result could be obtained via gapminder <- gapminder[, 1:6]
. The first method uses the fact that a data frame is also a list, and uses list subsetting methods. It is slightly preferable, since even if only one variable is retained, the object will still be a data frame, while the other method can return a vector in this case. Note this difference in the resulting x
variable below (again this behavior can be frustrating at times if it is not anticipated).
x y z
1 1 dog 1.0
2 2 cat 1.5
3 3 pig 2.0
x
1 1
2 2
3 3
x y z
1 1 dog 1.0
2 2 cat 1.5
3 3 pig 2.0
[1] 1 2 3
One can also use a negative sign in front of the variable number(s). For example, a[-(2:3)]
would drop the first two columns of a
. Some care is needed when removing variables using the negative sign.
An alternative approach is to set the variables you’d like to remove to NULL
. For example, a[c("y","z")] <- NULL
and a[,2:3] <- NULL
produce the same result as above.
6.4.3 Practice Problem
What happens if you write a[-2:3]
instead of a[-(2:3)]
? Why are the parentheses important here?
6.4.4 Transforming Variables
Consider the gapminder data again. Possibly we don’t want to add a new variable that gives life expectancy in months, but rather want to modify the existing variable to measure life expectancy in months. Here are two ways to accomplish this.
[1] 28.80 30.33 32.00 34.02 36.09
[1] 345.6 364.0 384.0 408.2 433.1
[1] 28.80 30.33 32.00 34.02 36.09
[1] 345.6 364.0 384.0 408.2 433.1
6.5 Rearranging Variables
Consider the full weather data set again.
> u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
> WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE,
+ stringsAsFactors = FALSE,
+ na.string = "")
> names(WeatherKLAN2014Full)
[1] "EST"
[2] "Max.TemperatureF"
[3] "Mean.TemperatureF"
[4] "Min.TemperatureF"
[5] "Max.Dew.PointF"
[6] "MeanDew.PointF"
[7] "Min.DewpointF"
[8] "Max.Humidity"
[9] "Mean.Humidity"
[10] "Min.Humidity"
[11] "Max.Sea.Level.PressureIn"
[12] "Mean.Sea.Level.PressureIn"
[13] "Min.Sea.Level.PressureIn"
[14] "Max.VisibilityMiles"
[15] "Mean.VisibilityMiles"
[16] "Min.VisibilityMiles"
[17] "Max.Wind.SpeedMPH"
[18] "Mean.Wind.SpeedMPH"
[19] "Max.Gust.SpeedMPH"
[20] "PrecipitationIn"
[21] "CloudCover"
[22] "Events"
[23] "WindDirDegrees"
If we want the wind speed variables to come right after the date, we can again use subsetting.
> WeatherKLAN2014Full <- WeatherKLAN2014Full[c(1, 17, 18,
+ 19, 2:16, 20:23)]
> names(WeatherKLAN2014Full)
[1] "EST"
[2] "Max.Wind.SpeedMPH"
[3] "Mean.Wind.SpeedMPH"
[4] "Max.Gust.SpeedMPH"
[5] "Max.TemperatureF"
[6] "Mean.TemperatureF"
[7] "Min.TemperatureF"
[8] "Max.Dew.PointF"
[9] "MeanDew.PointF"
[10] "Min.DewpointF"
[11] "Max.Humidity"
[12] "Mean.Humidity"
[13] "Min.Humidity"
[14] "Max.Sea.Level.PressureIn"
[15] "Mean.Sea.Level.PressureIn"
[16] "Min.Sea.Level.PressureIn"
[17] "Max.VisibilityMiles"
[18] "Mean.VisibilityMiles"
[19] "Min.VisibilityMiles"
[20] "PrecipitationIn"
[21] "CloudCover"
[22] "Events"
[23] "WindDirDegrees"
6.6 Reshaping Data
A data set can be represented in several different formats. Consider a (fictitious) data set on incomes of three people during three different years. Here is one representation of the data:
name income1990 income2000 income2010
1 John Smith 29784 39210 41213
2 Jane Doe 56789 89321 109321
3 Albert Jones 2341 34567 56781
Here is another representation of the same data:
name year income
1 John Smith income1990 29784
2 Jane Doe income1990 56789
3 Albert Jones income1990 2341
4 John Smith income2000 39210
5 Jane Doe income2000 89321
6 Albert Jones income2000 34567
7 John Smith income2010 41213
8 Jane Doe income2010 109321
9 Albert Jones income2010 56781
For hopefully obvious reasons, the first representation is called a wide representation of the data, and the second is called a long representation. Each has its merits. The first representation is probably easier for people to read, while the second is often the form needed for analysis by statistical software such as R. There are of course other representations. For example the rows and columns could be interchanged to create a different wide representation, or the long representation, which currently groups data by year, could group by name instead.
Whatever the relative merits of wide and long representations of data, transforming data from wide to long or long to wide is often required. As with many tasks, there are several ways to accomplish this in R. We will focus on a library called tidyr
written by Hadley Wickham that performs the transformations and more.
6.6.1 tidyr
The R library tidyr
has functions for converting data between formats. To illustrate its use, we examine a simple data set that explores the relationship between religion and income in the United States. The data come from a Pew survey, and are used in the tidyr
documentation to illustrate transforming data from wide to long format.
> u.rel <- "https://www.finley-lab.com/files/data/religion2.csv"
> religion <- read.csv(u.rel, header = TRUE, stringsAsFactors = FALSE)
> head(religion)
religion under10k btw10and20k btw20and30k
1 Agnostic 27 34 60
2 Atheist 12 27 37
3 Buddhist 27 21 30
4 Catholic 418 617 732
5 DoNotKnowOrRefused 15 14 15
6 EvangelicalProt 575 869 1064
btw30and40k btw40and50k btw50and75k btw75and100k
1 81 76 137 122
2 52 35 70 73
3 34 33 58 62
4 670 638 1116 949
5 11 10 35 21
6 982 881 1486 949
btw100and150k over150k DoNotKnowOrRefused
1 109 84 96
2 59 74 76
3 39 53 54
4 792 633 1489
5 17 18 116
6 723 414 1529
As given, the columns include religion and income level, and there are counts for each of the combinations of religion and income level. For example, there are 27 people who are Agnostic and whose income is less than 10 thousand dollars, and there are 617 people who are Catholic and whose income is between 10 and 20 thousand dollars.
The gather()
function can transform data from wide to long format.
> library(tidyr)
> religionLong <- gather(data = religion, key = IncomeLevel,
+ value = Frequency, 2:11)
> head(religionLong)
religion IncomeLevel Frequency
1 Agnostic under10k 27
2 Atheist under10k 12
3 Buddhist under10k 27
4 Catholic under10k 418
5 DoNotKnowOrRefused under10k 15
6 EvangelicalProt under10k 575
religion IncomeLevel Frequency
175 Muslim DoNotKnowOrRefused 22
176 Orthodox DoNotKnowOrRefused 73
177 OtherChristian DoNotKnowOrRefused 18
178 OtherFaiths DoNotKnowOrRefused 71
179 OtherWorldReligions DoNotKnowOrRefused 8
180 Unaffiliated DoNotKnowOrRefused 597
To use gather()
we specified the data frame (data = religion
), the name we want to give to the column created from the income levels (key = IncomeLevel
), the name we want to give to the column containing the frequency values (value = Frequency
) and the columns to gather (2:11
).
Columns to be gathered can be specified by name also, and we can also specify which columns should be omitted using a negative sign in front of the name(s). So the following creates an equivalent data frame:
> religionLong <- gather(data = religion, key = IncomeLevel,
+ value = Frequency, -religion)
> head(religionLong)
religion IncomeLevel Frequency
1 Agnostic under10k 27
2 Atheist under10k 12
3 Buddhist under10k 27
4 Catholic under10k 418
5 DoNotKnowOrRefused under10k 15
6 EvangelicalProt under10k 575
> religionWide <- spread(data = religionLong, key = IncomeLevel,
+ value = Frequency)
> head(religionWide)
religion btw100and150k btw10and20k
1 Agnostic 109 34
2 Atheist 59 27
3 Buddhist 39 21
4 Catholic 792 617
5 DoNotKnowOrRefused 17 14
6 EvangelicalProt 723 869
btw20and30k btw30and40k btw40and50k btw50and75k
1 60 81 76 137
2 37 52 35 70
3 30 34 33 58
4 732 670 638 1116
5 15 11 10 35
6 1064 982 881 1486
btw75and100k DoNotKnowOrRefused over150k under10k
1 122 96 84 27
2 73 76 74 12
3 62 54 53 27
4 949 1489 633 418
5 21 116 18 15
6 949 1529 414 575
Here we specify the data frame (religionLong
), the column (IncomeLevel
) to be spread, and the column of values (Frequency
) to be spread among the newly created columns. As can be seen, this particular call to spread()
yields the original data frame.
tidyr
provides two other useful functions to separate and unite variables based on some deliminator. Consider again the yearlyIncomeWide
table. Say we want to split the name
variable into first and last name. This can be done using the separate()
function.
> firstLast <- separate(data = yearlyIncomeLong, col = name,
+ into = c("first", "last"), sep="\\s")
> print(firstLast)
first last year income
1 John Smith income1990 29784
2 Jane Doe income1990 56789
3 Albert Jones income1990 2341
4 John Smith income2000 39210
5 Jane Doe income2000 89321
6 Albert Jones income2000 34567
7 John Smith income2010 41213
8 Jane Doe income2010 109321
9 Albert Jones income2010 56781
Now say, you’re not happy with that and you want to combine the name column again, but this time separate the first and last name with a underscore. This is done using the unite()
function.
name year income
1 John_Smith income1990 29784
2 Jane_Doe income1990 56789
3 Albert_Jones income1990 2341
4 John_Smith income2000 39210
5 Jane_Doe income2000 89321
6 Albert_Jones income2000 34567
7 John_Smith income2010 41213
8 Jane_Doe income2010 109321
9 Albert_Jones income2010 56781
6.6.2 Practice Problem
The separate()
function from the tidyr
library is especially useful when working with real data as multiple pieces of information can be combined into one column in a data set. For example, consider the following data frame
birds recordingInfo
1 10 LA-2017-01-01
2 38 DF-2011-03-02
3 29 OG-2078-05-11
4 88 YA-2000-11-18
5 42 LA-2019-03-17
6 177 OG-2016-10-10
7 200 YA-2001-03-22
“RecordingInfo” that contains the site where data was collected, as well as the year, month, and date the data was recorded. The data are coded as follows: site-year-month-day
. Write a line of code that will extract the desired data from the data.birds
data frame into separate columns named site
, year
, month
and day
.
6.7 Manipulating Data with dplyr
Much of the effort (a figure of 80% is sometimes suggested) in data analysis is spent cleaning the data and getting it ready for analysis. Having effective tools for this task can save substantial time and effort. The R package dplyr
written by Hadley Wickham is designed, in Hadley’s words, to be “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Casting data analysis tasks in terms of “grammar” should be familiar from our work with the ggplot2
package, which was also authored by Hadley. Functions provided by dplyr
do in fact capture key data analysis actions (i.e., verbs). These functions include
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarize()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rows.
These all combine naturally with a group_by()
function that allows you to perform any operation grouped by values of one or more variables. All the tasks done using dplyr
can be accomplished using tools already covered in this text; however, dplyr
’s functions provide a potentially more efficient and convenient framework to accomplish these tasks. RStudio provides a convenient data wrangling cheat sheet that covers many aspects of the tidyr
and dplyr
packages.
This somewhat long section on dplyr
adapts the nice introduction by Jenny Bryan, available at http://stat545-ubc.github.io/block010_dplyr-end-single-table.html.
6.7.1 Improved Data Frames
The dplyr
package provides a couple functions that offer improvements on data frames. First, tibble
creates a tibble from a series of vectors34. A tibble has two advantages over a data frame. First, when printing, it only prints the first ten rows and the columns that fit on the page, as well as some additional information about the table’s dimension, data type of variables, and non-printed columns. Second, recall that subsetting a data frame can sometimes return a vector rather than a data frame (if only one row or column is the result of the subset), a tibble does not have this behavior. Second, we can use the as_tibble()
function to convert an existing data frame or list into a tibble. Here is an example using the religionWide
data frame.35
religion btw100and150k btw10and20k
1 Agnostic 109 34
2 Atheist 59 27
3 Buddhist 39 21
4 Catholic 792 617
5 DoNotKnowOrRefused 17 14
6 EvangelicalProt 723 869
btw20and30k btw30and40k btw40and50k btw50and75k
1 60 81 76 137
2 37 52 35 70
3 30 34 33 58
4 732 670 638 1116
5 15 11 10 35
6 1064 982 881 1486
btw75and100k DoNotKnowOrRefused over150k under10k
1 122 96 84 27
2 73 76 74 12
3 62 54 53 27
4 949 1489 633 418
5 21 116 18 15
6 949 1529 414 575
[1] "Agnostic" "Atheist"
[3] "Buddhist" "Catholic"
[5] "DoNotKnowOrRefused" "EvangelicalProt"
[7] "Hindu" "HistoricallyBlackProt"
[9] "JehovahsWitness" "Jewish"
[11] "MainlineProt" "Mormon"
[13] "Muslim" "Orthodox"
[15] "OtherChristian" "OtherFaiths"
[17] "OtherWorldReligions" "Unaffiliated"
# A tibble: 6 x 11
religion btw100and150k btw10and20k btw20and30k
<chr> <int> <int> <int>
1 Agnostic 109 34 60
2 Atheist 59 27 37
3 Buddhist 39 21 30
4 Catholic 792 617 732
5 DoNotKn… 17 14 15
6 Evangel… 723 869 1064
# … with 7 more variables: btw30and40k <int>,
# btw40and50k <int>, btw50and75k <int>,
# btw75and100k <int>, DoNotKnowOrRefused <int>,
# over150k <int>, under10k <int>
# A tibble: 18 x 1
religion
<chr>
1 Agnostic
2 Atheist
3 Buddhist
4 Catholic
5 DoNotKnowOrRefused
6 EvangelicalProt
7 Hindu
8 HistoricallyBlackProt
9 JehovahsWitness
10 Jewish
11 MainlineProt
12 Mormon
13 Muslim
14 Orthodox
15 OtherChristian
16 OtherFaiths
17 OtherWorldReligions
18 Unaffiliated
As seen above, note that once the data frame is reduced to one dimension by subsetting to one column, it is no longer a data frame and has been simplified to a vector. This might not seem like a big deal; however, it can be very frustrating and potentially break your code when you expect an object to behave like a data frame and it doesn’t because it’s now a vector. Alternatively, once we convert religionWide
to a tibble
via the as_tibble()
function the object remains a data frame even when subsetting down to one dimension (there is no automatic simplification). Converting data frames using as_tibble()
is not required for using dplyr
but is convenient. Also, it is important to note that tibble
is simply a wrapper around a data frame that provides some additional behaviors. The newly formed tibble
object will still behave like a data frame (because it technically still is a data frame) but will have some added niceties (some of which are illustrated below).
6.7.2 Filtering Data by Row
Recall the gapminder
data. These data are available in tab-separated format in gapminder.tsv
, and can be read in using read.delim()
(or the related read functions described previously). The read.delim()
function defaults to header = TRUE
so this doesn’t need to be specified explicitly. In this section we will be working with the gapminder
data often, so we will use a short name for the data frame to save typing.
> u.gm <- "https://www.finley-lab.com/files/data/gapminder.tsv"
> gm <- read.delim(u.gm)
> gm <- as_tibble(gm)
> str(gm)
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
# A tibble: 6 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Afghanistan 1952 8.43e6 Asia 28.8 779.
2 Afghanistan 1957 9.24e6 Asia 30.3 821.
3 Afghanistan 1962 1.03e7 Asia 32.0 853.
4 Afghanistan 1967 1.15e7 Asia 34.0 836.
5 Afghanistan 1972 1.31e7 Asia 36.1 740.
6 Afghanistan 1977 1.49e7 Asia 38.4 786.
Filtering helps us to examine subsets of the data such as data from a particular country, from several specified countries, from certain years, from countries with certain populations, etc. Some examples:
# A tibble: 12 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Brazil 1952 56602560 Americas 50.9 2109.
2 Brazil 1957 65551171 Americas 53.3 2487.
3 Brazil 1962 76039390 Americas 55.7 3337.
4 Brazil 1967 88049823 Americas 57.6 3430.
5 Brazil 1972 100840058 Americas 59.5 4986.
6 Brazil 1977 114313951 Americas 61.5 6660.
7 Brazil 1982 128962939 Americas 63.3 7031.
8 Brazil 1987 142938076 Americas 65.2 7807.
9 Brazil 1992 155975974 Americas 67.1 6950.
10 Brazil 1997 168546719 Americas 69.4 7958.
11 Brazil 2002 179914212 Americas 71.0 8131.
12 Brazil 2007 190010647 Americas 72.4 9066.
# A tibble: 24 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Brazil 1952 56602560 Americas 50.9 2109.
2 Brazil 1957 65551171 Americas 53.3 2487.
3 Brazil 1962 76039390 Americas 55.7 3337.
4 Brazil 1967 88049823 Americas 57.6 3430.
5 Brazil 1972 100840058 Americas 59.5 4986.
6 Brazil 1977 114313951 Americas 61.5 6660.
7 Brazil 1982 128962939 Americas 63.3 7031.
8 Brazil 1987 142938076 Americas 65.2 7807.
9 Brazil 1992 155975974 Americas 67.1 6950.
10 Brazil 1997 168546719 Americas 69.4 7958.
# … with 14 more rows
# A tibble: 4 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Brazil 1952 56602560 Americas 50.9 2109.
2 Brazil 1972 100840058 Americas 59.5 4986.
3 Mexico 1952 30144317 Americas 50.8 3478.
4 Mexico 1972 55984294 Americas 62.4 6809.
# A tibble: 25 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 China 1952 5.56e8 Asia 44 400.
2 China 1957 6.37e8 Asia 50.5 576.
3 China 1962 6.66e8 Asia 44.5 488.
4 China 1967 7.55e8 Asia 58.4 613.
5 China 1972 8.62e8 Asia 63.1 677.
6 China 1977 9.43e8 Asia 64.0 741.
7 China 1982 1.00e9 Asia 65.5 962.
8 China 1987 1.08e9 Asia 67.3 1379.
9 China 1992 1.16e9 Asia 68.7 1656.
10 China 1997 1.23e9 Asia 70.4 2289.
# … with 15 more rows
# A tibble: 3 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 China 2007 1.32e9 Asia 73.0 4959.
2 India 2007 1.11e9 Asia 64.7 2452.
3 United St… 2007 3.01e8 Americas 78.2 42952.
Notice the full results are not printed. For example, when we asked for the data for Brazil and Mexico, only the first ten rows were printed. This is an effect of using the as_tibble()
function. Of course if we wanted to analyze the results (as we will below) the full set of data would be available.
6.7.3 Selecting variables by column
Continuing with the gapminder
data, another common task is to restrict attention to some subset of variables in the data set. The select()
function does this.
# A tibble: 1,704 x 3
country year lifeExp
<fct> <int> <dbl>
1 Afghanistan 1952 28.8
2 Afghanistan 1957 30.3
3 Afghanistan 1962 32.0
4 Afghanistan 1967 34.0
5 Afghanistan 1972 36.1
6 Afghanistan 1977 38.4
7 Afghanistan 1982 39.9
8 Afghanistan 1987 40.8
9 Afghanistan 1992 41.7
10 Afghanistan 1997 41.8
# … with 1,694 more rows
# A tibble: 1,704 x 3
year pop continent
<int> <dbl> <fct>
1 1952 8425333 Asia
2 1957 9240934 Asia
3 1962 10267083 Asia
4 1967 11537966 Asia
5 1972 13079460 Asia
6 1977 14880372 Asia
7 1982 12881816 Asia
8 1987 13867957 Asia
9 1992 16317921 Asia
10 1997 22227415 Asia
# … with 1,694 more rows
# A tibble: 1,704 x 3
country lifeExp gdpPercap
<fct> <dbl> <dbl>
1 Afghanistan 28.8 779.
2 Afghanistan 30.3 821.
3 Afghanistan 32.0 853.
4 Afghanistan 34.0 836.
5 Afghanistan 36.1 740.
6 Afghanistan 38.4 786.
7 Afghanistan 39.9 978.
8 Afghanistan 40.8 852.
9 Afghanistan 41.7 649.
10 Afghanistan 41.8 635.
# … with 1,694 more rows
# A tibble: 1,704 x 2
country continent
<fct> <fct>
1 Afghanistan Asia
2 Afghanistan Asia
3 Afghanistan Asia
4 Afghanistan Asia
5 Afghanistan Asia
6 Afghanistan Asia
7 Afghanistan Asia
8 Afghanistan Asia
9 Afghanistan Asia
10 Afghanistan Asia
# … with 1,694 more rows
Notice a few things. Variables can be selected by name or column number. As usual, a negative sign tells R to leave something out. And there are special functions such as starts_with
that provide ways to match part of a variable’s name.
6.7.4 Practice Problem
Use the contains()
function to select only the columns that contain a c
6.7.5 Pipes
Consider selecting the country, year, and population for countries in Asia or Europe. One possibility is to nest a filter()
function inside a select()
function.
# A tibble: 756 x 3
country year pop
<fct> <int> <dbl>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
7 Afghanistan 1982 12881816
8 Afghanistan 1987 13867957
9 Afghanistan 1992 16317921
10 Afghanistan 1997 22227415
# … with 746 more rows
Even a two-step process like this becomes hard to follow in this nested form, and often we will want to perform more than two operations. There is a nice feature in dplyr
that allows us to “feed” results of one function into the first argument of a subsequent function. Another way of saying this is that we are “piping” the results into another function. The %>%
operator does the piping. Here we again restrict attention to country, year, and population for countries in Asia or Europe36.
# A tibble: 756 x 3
country year pop
<fct> <int> <dbl>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
7 Afghanistan 1982 12881816
8 Afghanistan 1987 13867957
9 Afghanistan 1992 16317921
10 Afghanistan 1997 22227415
# … with 746 more rows
It can help to think of %>%
as representing the word “then”. The above can be read as, “Start with the data frame gm
then filter it to select data from the continents Asia and Europe then select the variables country, year, and population from these data”.
The pipe operator %>%
is not restricted to functions in dplyr
. In fact the pipe operator itself was introduced in another package called magrittr
, but is included in dplyr
as a convenience.
6.7.6 Arranging Data by Row
By default the gapminder
data are arranged by country and then by year.
# A tibble: 15 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Afghanist… 1952 8.43e6 Asia 28.8 779.
2 Afghanist… 1957 9.24e6 Asia 30.3 821.
3 Afghanist… 1962 1.03e7 Asia 32.0 853.
4 Afghanist… 1967 1.15e7 Asia 34.0 836.
5 Afghanist… 1972 1.31e7 Asia 36.1 740.
6 Afghanist… 1977 1.49e7 Asia 38.4 786.
7 Afghanist… 1982 1.29e7 Asia 39.9 978.
8 Afghanist… 1987 1.39e7 Asia 40.8 852.
9 Afghanist… 1992 1.63e7 Asia 41.7 649.
10 Afghanist… 1997 2.22e7 Asia 41.8 635.
11 Afghanist… 2002 2.53e7 Asia 42.1 727.
12 Afghanist… 2007 3.19e7 Asia 43.8 975.
13 Albania 1952 1.28e6 Europe 55.2 1601.
14 Albania 1957 1.48e6 Europe 59.3 1942.
15 Albania 1962 1.73e6 Europe 64.8 2313.
Possibly arranging the data by year and then country would be desired. The arrange()
function makes this easy. We will again use pipes.
# A tibble: 1,704 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Afghanist… 1952 8.43e6 Asia 28.8 779.
2 Albania 1952 1.28e6 Europe 55.2 1601.
3 Algeria 1952 9.28e6 Africa 43.1 2449.
4 Angola 1952 4.23e6 Africa 30.0 3521.
5 Argentina 1952 1.79e7 Americas 62.5 5911.
6 Australia 1952 8.69e6 Oceania 69.1 10040.
7 Austria 1952 6.93e6 Europe 66.8 6137.
8 Bahrain 1952 1.20e5 Asia 50.9 9867.
9 Bangladesh 1952 4.69e7 Asia 37.5 684.
10 Belgium 1952 8.73e6 Europe 68 8343.
# … with 1,694 more rows
How about the data for Rwanda, arranged in order of life expectancy.
# A tibble: 12 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Rwanda 1992 7290203 Africa 23.6 737.
2 Rwanda 1997 7212583 Africa 36.1 590.
3 Rwanda 1952 2534927 Africa 40 493.
4 Rwanda 1957 2822082 Africa 41.5 540.
5 Rwanda 1962 3051242 Africa 43 597.
6 Rwanda 2002 7852401 Africa 43.4 786.
7 Rwanda 1987 6349365 Africa 44.0 848.
8 Rwanda 1967 3451079 Africa 44.1 511.
9 Rwanda 1972 3992121 Africa 44.6 591.
10 Rwanda 1977 4657072 Africa 45 670.
11 Rwanda 1982 5507565 Africa 46.2 882.
12 Rwanda 2007 8860588 Africa 46.2 863.
Possibly we want these data to be in decreasing (descending) order. Here, desc()
is one of many dplyr
helper functions.
# A tibble: 12 x 6
country year pop continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Rwanda 2007 8860588 Africa 46.2 863.
2 Rwanda 1982 5507565 Africa 46.2 882.
3 Rwanda 1977 4657072 Africa 45 670.
4 Rwanda 1972 3992121 Africa 44.6 591.
5 Rwanda 1967 3451079 Africa 44.1 511.
6 Rwanda 1987 6349365 Africa 44.0 848.
7 Rwanda 2002 7852401 Africa 43.4 786.
8 Rwanda 1962 3051242 Africa 43 597.
9 Rwanda 1957 2822082 Africa 41.5 540.
10 Rwanda 1952 2534927 Africa 40 493.
11 Rwanda 1997 7212583 Africa 36.1 590.
12 Rwanda 1992 7290203 Africa 23.6 737.
Possibly we want to include only the year and life expectancy, to make the message more stark.
# A tibble: 12 x 2
year lifeExp
<int> <dbl>
1 2007 46.2
2 1982 46.2
3 1977 45
4 1972 44.6
5 1967 44.1
6 1987 44.0
7 2002 43.4
8 1962 43
9 1957 41.5
10 1952 40
11 1997 36.1
12 1992 23.6
For analyzing data in R, the order shouldn’t matter. But for presentation to human eyes, the order is important.
6.7.7 Practice Problem
It is worth your while to get comfortable with using pipes. Here is some hard-to-read code. Convert it into more easier to read code by using pipes.
# A tibble: 12 x 2
year lifeExp
<int> <dbl>
1 2007 43.8
2 2002 42.1
3 1997 41.8
4 1992 41.7
5 1987 40.8
6 1982 39.9
7 1977 38.4
8 1972 36.1
9 1967 34.0
10 1962 32.0
11 1957 30.3
12 1952 28.8
6.7.8 Renaming Variables
The dplyr
package has a rename
function that makes renaming variables in a data frame quite easy.
# A tibble: 6 x 6
country year population continent lifeExp gdpPercap
<fct> <int> <dbl> <fct> <dbl> <dbl>
1 Afghani… 1952 8425333 Asia 28.8 779.
2 Afghani… 1957 9240934 Asia 30.3 821.
3 Afghani… 1962 10267083 Asia 32.0 853.
4 Afghani… 1967 11537966 Asia 34.0 836.
5 Afghani… 1972 13079460 Asia 36.1 740.
6 Afghani… 1977 14880372 Asia 38.4 786.
6.7.9 Data Summaries and Grouping
The summarize()
function computes summary statistics using user provided functions for one or more columns of data in a data frame.
# A tibble: 1 x 2
meanpop medpop
<dbl> <dbl>
1 29601212. 7023596.
# A tibble: 1 x 2
meanpop medpop
<dbl> <dbl>
1 29601212. 7023596.
Often we want summaries for specific components of the data. For example, we might want the median life expectancy for each continent separately. One option is subsetting:
[1] 47.79
[1] 61.79
[1] 72.24
[1] 67.05
[1] 73.66
The group_by()
function makes this easier, and makes the output more useful.
# A tibble: 5 x 2
continent medLifeExp
<fct> <dbl>
1 Africa 47.8
2 Americas 67.0
3 Asia 61.8
4 Europe 72.2
5 Oceania 73.7
Or if we want the results ordered by the median life expectancy:
> gm %>%
+ group_by(continent) %>%
+ summarize(medLifeExp = median(lifeExp)) %>%
+ arrange(medLifeExp)
# A tibble: 5 x 2
continent medLifeExp
<fct> <dbl>
1 Africa 47.8
2 Asia 61.8
3 Americas 67.0
4 Europe 72.2
5 Oceania 73.7
As another example, we calculate the number of observations we have per continent (using the n()
helper function), and then, among continents, how many distinct countries are represented (using n_distinct()
).
# A tibble: 5 x 2
continent numObs
<fct> <int>
1 Africa 624
2 Americas 300
3 Asia 396
4 Europe 360
5 Oceania 24
# A tibble: 5 x 3
continent n_obs n_countries
<fct> <int> <int>
1 Africa 624 52
2 Americas 300 25
3 Asia 396 33
4 Europe 360 30
5 Oceania 24 2
Here is a bit more involved example that calculates the minimum and maximum life expectancies for countries in Africa by year.
> gm %>%
+ filter(continent == "Africa") %>%
+ group_by(year) %>%
+ summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))
# A tibble: 12 x 3
year min_lifeExp max_lifeExp
<int> <dbl> <dbl>
1 1952 30 52.7
2 1957 31.6 58.1
3 1962 32.8 60.2
4 1967 34.1 61.6
5 1972 35.4 64.3
6 1977 36.8 67.1
7 1982 38.4 69.9
8 1987 39.9 71.9
9 1992 23.6 73.6
10 1997 36.1 74.8
11 2002 39.2 75.7
12 2007 39.6 76.4
This is interesting, but the results don’t include the countries that achieved the minimum and maximum life expectancies. Here is one way to achieve that. We will start with the minimum life expectancy. Note the rank of the minimum value will be 1.
> gm %>%
+ select(country, continent, year, lifeExp) %>%
+ group_by(year) %>%
+ arrange(year) %>%
+ filter(rank(lifeExp) == 1)
# A tibble: 12 x 4
# Groups: year [12]
country continent year lifeExp
<fct> <fct> <int> <dbl>
1 Afghanistan Asia 1952 28.8
2 Afghanistan Asia 1957 30.3
3 Afghanistan Asia 1962 32.0
4 Afghanistan Asia 1967 34.0
5 Sierra Leone Africa 1972 35.4
6 Cambodia Asia 1977 31.2
7 Sierra Leone Africa 1982 38.4
8 Angola Africa 1987 39.9
9 Rwanda Africa 1992 23.6
10 Rwanda Africa 1997 36.1
11 Zambia Africa 2002 39.2
12 Swaziland Africa 2007 39.6
Next we add the maximum life expectancy. Here we need to better understand the desc()
function, which will transform a vector into a numeric vector which will be sorted in descending order. Here are some examples.
[1] -1 -2 -3 -4 -5
[1] -2 -3 -1 -5 -6 4
[1] -1 -3 -2 -5 -4
We now use this to extract the maximum life expectancy. Recall that |
represents “or”. Also by default only the first few rows of a tibble
object will be printed. To see all the rows we pipe the output to print(n = 24)
to ask for all 24 rows to be printed.
> gm %>%
+ select(country, continent, year, lifeExp) %>%
+ group_by(year) %>%
+ arrange(year) %>%
+ filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>%
+ print(n=24)
# A tibble: 24 x 4
# Groups: year [12]
country continent year lifeExp
<fct> <fct> <int> <dbl>
1 Afghanistan Asia 1952 28.8
2 Norway Europe 1952 72.7
3 Afghanistan Asia 1957 30.3
4 Iceland Europe 1957 73.5
5 Afghanistan Asia 1962 32.0
6 Iceland Europe 1962 73.7
7 Afghanistan Asia 1967 34.0
8 Sweden Europe 1967 74.2
9 Sierra Leone Africa 1972 35.4
10 Sweden Europe 1972 74.7
11 Cambodia Asia 1977 31.2
12 Iceland Europe 1977 76.1
13 Japan Asia 1982 77.1
14 Sierra Leone Africa 1982 38.4
15 Angola Africa 1987 39.9
16 Japan Asia 1987 78.7
17 Japan Asia 1992 79.4
18 Rwanda Africa 1992 23.6
19 Japan Asia 1997 80.7
20 Rwanda Africa 1997 36.1
21 Japan Asia 2002 82
22 Zambia Africa 2002 39.2
23 Japan Asia 2007 82.6
24 Swaziland Africa 2007 39.6
6.7.10 Creating New Variables
The $
notation provides a simple way to create new variables in a data frame. The mutate()
function provides another, sometimes cleaner way to do this. We will use mutate()
along with the lag()
function to investigate changes in life expectancy over five years for the gapminder
data. We’ll do this in a few steps. First, we create a variable that measures the change in life expectancy and remove the population and GDP variables that are not of interest. We have to be careful to first group by country, since we want to calculate the change in life expectancy by country.
> gm %>%
+ group_by(country) %>%
+ mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
+ select(-c(population, gdpPercap))
# A tibble: 1,704 x 5
# Groups: country [142]
country year continent lifeExp changeLifeExp
<fct> <int> <fct> <dbl> <dbl>
1 Afghanistan 1952 Asia 28.8 NA
2 Afghanistan 1957 Asia 30.3 1.53
3 Afghanistan 1962 Asia 32.0 1.66
4 Afghanistan 1967 Asia 34.0 2.02
5 Afghanistan 1972 Asia 36.1 2.07
6 Afghanistan 1977 Asia 38.4 2.35
7 Afghanistan 1982 Asia 39.9 1.42
8 Afghanistan 1987 Asia 40.8 0.968
9 Afghanistan 1992 Asia 41.7 0.852
10 Afghanistan 1997 Asia 41.8 0.0890
# … with 1,694 more rows
Next, summarize by computing the largest drop in life expectancy.
> gm %>%
+ group_by(country) %>%
+ mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
+ select(-c(population, gdpPercap)) %>%
+ summarize(largestDropLifeExp = min(changeLifeExp))
# A tibble: 142 x 2
country largestDropLifeExp
<fct> <dbl>
1 Afghanistan NA
2 Albania NA
3 Algeria NA
4 Angola NA
5 Argentina NA
6 Australia NA
7 Austria NA
8 Bahrain NA
9 Bangladesh NA
10 Belgium NA
# … with 132 more rows
Oops. We forgot that since we don’t have data from before 1952, the first drop will be NA
. Let’s try again.
> gm %>%
+ group_by(country) %>%
+ mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
+ select(-c(population, gdpPercap)) %>%
+ summarize(largestDropLifeExp = min(changeLifeExp, na.rm = TRUE))
# A tibble: 142 x 2
country largestDropLifeExp
<fct> <dbl>
1 Afghanistan 0.0890
2 Albania -0.419
3 Algeria 1.31
4 Angola -0.036
5 Argentina 0.492
6 Australia 0.170
7 Austria 0.490
8 Bahrain 0.84
9 Bangladesh 1.67
10 Belgium 0.5
# … with 132 more rows
That’s not quite what we wanted. We could arrange the results by the life expectancy drop, but it would be good to have both the continent and year printed out also. So we’ll take a slightly different approach, by arranging the results in increasing order.
> gm %>%
+ group_by(country) %>%
+ mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>%
+ select(-c(population, gdpPercap)) %>%
+ arrange(changeLifeExp)
# A tibble: 1,704 x 5
# Groups: country [142]
country year continent lifeExp changeLifeExp
<fct> <int> <fct> <dbl> <dbl>
1 Rwanda 1992 Africa 23.6 -20.4
2 Zimbabwe 1997 Africa 46.8 -13.6
3 Lesotho 2002 Africa 44.6 -11.0
4 Swaziland 2002 Africa 43.9 -10.4
5 Botswana 1997 Africa 52.6 -10.2
6 Cambodia 1977 Asia 31.2 -9.10
7 Namibia 2002 Africa 51.5 -7.43
8 South Africa 2002 Africa 53.4 -6.87
9 Zimbabwe 2002 Africa 40.0 -6.82
10 China 1962 Asia 44.5 -6.05
# … with 1,694 more rows
That’s still not quite right. Because the data are grouped by country, R did the ordering within group. If we want to see the largest drops overall, we need to remove the grouping.
> gm %>% group_by(country) %>% mutate(changeLifeExp = lifeExp -
+ lag(lifeExp, order_by = year)) %>% select(-c(population,
+ gdpPercap)) %>% ungroup() %>% arrange(changeLifeExp) %>%
+ print(n = 20)
# A tibble: 1,704 x 5
country year continent lifeExp changeLifeExp
<fct> <int> <fct> <dbl> <dbl>
1 Rwanda 1992 Africa 23.6 -20.4
2 Zimbabwe 1997 Africa 46.8 -13.6
3 Lesotho 2002 Africa 44.6 -11.0
4 Swaziland 2002 Africa 43.9 -10.4
5 Botswana 1997 Africa 52.6 -10.2
6 Cambodia 1977 Asia 31.2 -9.10
7 Namibia 2002 Africa 51.5 -7.43
8 South Africa 2002 Africa 53.4 -6.87
9 Zimbabwe 2002 Africa 40.0 -6.82
10 China 1962 Asia 44.5 -6.05
11 Botswana 2002 Africa 46.6 -5.92
12 Zambia 1997 Africa 40.2 -5.86
13 Iraq 1992 Asia 59.5 -5.58
14 Liberia 1992 Africa 40.8 -5.23
15 Cambodia 1972 Asia 40.3 -5.10
16 Kenya 1997 Africa 54.4 -4.88
17 Somalia 1992 Africa 39.7 -4.84
18 Zambia 1992 Africa 46.1 -4.72
19 Swaziland 2007 Africa 39.6 -4.26
20 Uganda 1997 Africa 44.6 -4.25
# … with 1,684 more rows
6.7.11 Practice Problem
As you progress in your data science related career, we are sure you will find dplyr
as one of the most useful packages for your initial data exploration. Here is one more Practice Problem to get more comfortable with the syntax.
Recall the iris
data set we have worked with multiple times. We want to look at the ratio of Sepal.Length
to Petal.Length
in the three different species. Write a series of dplyr
statements that groups the data by species, creates a new column called s.p.ratio
that is the Sepal.Length
divided by Petal.Length
, then computes the mean of this column for each species in a column called mean.ratio
. Display the data in descending order of mean.ratio
.
6.8 Exercises
Exercise 7 Learning objectives: introduce with()
, tapply()
, and cut()
functions; summarize data using the table()
function with logical subsetting; practice using factor data types.
Exercise 8 Learning objectives: work with messy data; import data from an external spreadsheet; practice using functions in tidyr
and graphing functions.}
Exercise 9 {Learning objectives: work with several key dplyr functions; manipulate data frames (actually tibbles); summarize and visualize data from large data files.}
References
Spector, Phil. 2008. Data Manipulation with R. Use R! pub-sv:adr: pub-sv.
These data come from the
MASS
R library.↩Reminds me of The Trouble with Tribbles↩
The text printed immediately after
library(dplyr)
means thestats
andbase
packages, which are automatically loaded when you start R, have functions with the same name as functions indplyr
. So, for example, if you call thefilter()
orlag()
functions, R will uselibrary(dplyr)
‘s functions. Use the::
operator to explicity identify which packages’ function you want to use, e.g., if you wantstats
’slag()
then callstats::lag()
.↩Notice the indentation used in the code. This is not necessary, as the code could be all on one line, but I often find it easier to read in this more organized format↩