Chapter 4 Data Structures
A data structure is a format for organizing and storing data. The structure is designed so that data can be accessed and worked with in specific ways. Statistical software and programming languages have methods (or functions) designed to operate on different kinds of data structures.
This chapter’s focus is on data structures. To help initial understanding, the data in this chapter will be relatively modest in size and complexity. The ideas and methods, however, generalize to larger and more complex data sets.
The base data structures in R are vectors, matrices, arrays, data frames, and lists. The first three, vectors, matrices, and arrays, require all elements to be of the same type or homogeneous, e.g., all numeric or all character. Data frames and lists allow elements to be of different types or heterogeneous, e.g., some elements of a data frame may be numeric while other elements may be character. These base structures can also be organized by their dimensionality, i.e., 1-dimensional, 2-dimensional, or N-dimensional, as shown in Table 4.1.
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1 | Atomic vector | List |
2 | Matrix | Data frame |
N | Array |
R has no scalar types, i.e., 0-dimensional. Individual numbers or strings are actually vectors of length one.
An efficient way to understand what comprises a given object is to use the str()
function. str()
is short for structure and prints a compact, human-readable description of any R data structure. For example, in the code below, we prove to ourselves that what we might think of as a scalar value is actually a vector of length one.
num 1
[1] TRUE
[1] 1
Here we assigned a
the scalar value one. The str(a)
prints num 1
, which says a
is numeric of length one. Then just to be sure we used the function is.vector()
to test if a
is in fact a vector. Then, just for fun, we asked the length of a
, which again returns one. There are a set of similar logical tests for the other base data structures, e.g., is.matrix()
, is.array()
, is.data.frame()
, and is.list()
. These will all come in handy as we encounter different R objects.
4.1 Vectors
Think of a vector21 as a structure to represent one variable in a data set. For example a vector might hold the weights, in pounds, of 7 people in a data set. Or another vector might hold the genders of those 7 people. The c()
function in R is useful for creating (small) vectors and for modifying existing vectors. Think of c
as standing for “combine”.
[1] 123 157 205 199 223 140 105
[1] "female" "female" "male" "female" "male"
[6] "male" "female"
Notice that elements of a vector are separated by commas when using the c()
function to create a vector. Also notice that character values are placed inside quotation marks.
The c()
function also can be used to add to an existing vector. For example, if an eighth male person was included in the data set, and his weight was 194 pounds, the existing vectors could be modified as follows.
[1] 123 157 205 199 223 140 105 194
[1] "female" "female" "male" "female" "male"
[6] "male" "female" "male"
4.1.1 Types, Conversion, Coercion
Clearly it is important to distinguish between different types of vectors. For example, it makes sense to ask R to calculate the mean of the weights stored in weight
, but does not make sense to ask R to compute the mean of the genders stored in gender
. Vectors in R may have one of six different “types”: character, double, integer, logical, complex, and raw. Only the first four of these will be of interest below, and the distinction between double and integer will not be of great import. To illustrate logical vectors, imagine that each of the eight people in the data set was asked whether he or she was taking blood pressure medication, and the responses were coded as TRUE
if the person answered yes, and FALSE
if the person answered no.
[1] "double"
[1] "character"
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
[1] "logical"
It may be surprising to see that the variable weight
is of double
type, even though its values all are integers. By default R creates a double type vector when numeric values are given via the c()
function.
When it makes sense, it is possible to convert vectors to a different type. Consider the following examples.
[1] 123 157 205 199 223 140 105 194
[1] "integer"
[1] "123" "157" "205" "199" "223" "140" "105" "194"
[1] 0 1 0 0 1 0 1 0
Warning: NAs introduced by coercion
[1] NA NA NA NA NA NA NA NA
[1] 3
The integer version of weight
doesn’t look any different, but it is stored differently, which can be important both for computational efficiency and for interfacing with other languages such as C++
. As noted above, however, we will not worry about the distinction between integer and double types. Converting weight
to character goes as expected: The character representations of the numbers replace the numbers themselves. Converting the logical vector bp
to double is pretty straightforward too: FALSE
is converted to zero, and TRUE
is converted to one. Now think about converting the character vector gender
to a numeric double vector. It’s not at all clear how to represent “female” and “male” as numbers. In fact in this case what R does is to create a character vector, but with each element set to NA
, which is the representation of missing data.22 Finally consider the code sum(bp)
. Now bp
is a logical vector, but when R sees that we are asking to sum this logical vector, it automatically converts it to a numerical vector and then adds the zeros and ones representing FALSE
and TRUE
.
R also has functions to test whether a vector is of a particular type.
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
4.1.1.1 Coercion
Consider the following examples.
[1] 1 2 3 1
[1] "1" "2" "3" "dog"
[1] "TRUE" "FALSE" "cat"
[1] 123 158 205 199 224 140 106 194
Vectors in R can only contain elements of one type. If more than one type is included in a c()
function, R silently coerces the vector to be of one type. The examples illustrate the hierarchy—if any element is a character, then the whole vector is character. If some elements are numeric (either integer or double) and other elements are logical, the whole vector is numeric. Note what happened when R was asked to add the numeric vector weight
to the logical vector bp
. The logical vector was silently coerced to be numeric, so that FALSE became zero and TRUE became one, and then the two numeric vectors were added.
4.1.2 Accessing Specific Elements of Vectors
To access and possibly change specific elements of vectors, refer to the position of the element in square brackets. For example, weight[4]
refers to the fourth element of the vector weight
. Note that R starts the numbering of elements at 1, i.e., the first element of a vector x
is x[1]
.
[1] 123 157 205 199 223 140 105 194
[1] 223
[1] 123 157 205
[1] 8
[1] 194
[1] 123 157 205 199 223 140 105 194
[1] 123 157 202 199 223 140 105 194
Note that including nothing in the square brackets results in the whole vector being returned.
Negative numbers in the square brackets tell R to omit the corresponding value. And a zero as a subscript returns nothing (more precisely, it returns a length zero vector of the appropriate type).
[1] 123 157 199 223 140 105 194
[1] 123 157 202 199 223 140 105
[1] 157 199 140 105 194
numeric(0)
[1] 157 123
Error in weight[c(-1, 2)]: only 0's may be mixed with negative subscripts
Note that mixing zero and other nonzero subscripts is allowed, but mixing negative and positive subscripts is not allowed.
What about the (usual) case where we don’t know the positions of the elements we want? For example possibly we want the weights of all females in the data. Later we will learn how to subset using logical indices, which is a very powerful way to access desired elements of a vector.
4.1.3 Practice Problem
A bad programming technique that often plagues beginners is a technique called hardcoding. Consider the following simple vector containing data on the number of tree species found at different sites.
Suppose we are interested in the second to last value of the data set. One way to do this is to first determine the length of vector using the length()
function, then taking that value and subtracting 1.
[1] 10
[1] 9
This is an example of hardcoding. But what if we attempt to use the same code on a second vector of tree species data that has a different number of sites?
[1] 6
[1] NA
That’s clearly not what we want. Fix this code so we can always extract the second to last value in the vector, regardless of the length of the vector.
4.2 Factors
Categorical variables such as gender
can be represented as character vectors. In many cases this simple representation is sufficient. Consider, however, two other categorical variables, one representing age via categories youth
, young adult
, middle age
, senior
, and another representing income via categories lower
, middle
, and upper
. Suppose that for the small health data set, all the people are either middle aged or senior citizens. If we just represented the variable via a character vector, there would be no way to know that there are two other categories, representing youth and young adults, which happen not to be present in the data set. And for the income variable, the character vector representation does not explicitly indicate that there is an ordering of the levels.
Factors in R provide a more sophisticated way to represent categorical variables. Factors explicitly contain all possible levels, and allow ordering of levels.
> age <- c("middle age", "senior", "middle age", "senior",
+ "senior", "senior", "senior", "middle age")
> income <- c("lower", "lower", "upper", "middle", "upper",
+ "lower", "lower", "middle")
> age
[1] "middle age" "senior" "middle age" "senior"
[5] "senior" "senior" "senior" "middle age"
[1] "lower" "lower" "upper" "middle" "upper"
[6] "lower" "lower" "middle"
[1] middle age senior middle age senior
[5] senior senior senior middle age
Levels: youth young adult middle age senior
[1] lower lower upper middle upper lower lower
[8] middle
Levels: lower < middle < upper
In the factor version of age
the levels are explicitly listed, so it is clear that the two included levels are not all the possible levels. And in the factor version of income, the ordering is explicit.
In many cases the character vector representation of a categorical variable is sufficient and easier to work with. In this book, factors will not be used extensively. It is important to note that R often by default creates a factor when character data are read in, and sometimes it is necessary to use the argument stringsAsFactors = FALSE
to explicitly tell R not to do this. This is shown later in the chapter when data frames are introduced.
4.3 Names of Objects in R
There are few hard and fast restrictions on the names of objects (such as vectors) in R. In addition to these restrictions, there are certain good practices, and many things to avoid as well.
From the help page for make.names
in R, the name of an R object is “syntactically valid” if the name “consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number” and is not one of the “reserved words” in R such as if
, TRUE
, function
, etc. For example, c45t.le_dog
and .ty56
are both syntactically valid (although not very good names) while 4cats
and log#@gopher
are not.
A few important comments about naming objects follow:
- It is important to be aware that names of objects in R are case-sensitive, so
weight
andWeight
do not refer to the same object.
[1] 123 157 202 199 223 140 105 194
Error in eval(expr, envir, enclos): object 'Weight' not found
- It is unwise to create an object with the same name as a built in R object such as the function
c
or the functionmean
. In earlier versions of R this could be somewhat disastrous, but even in current versions, it is definitely not a good idea! - As much as possible, choose names that are informative. When creating a variable you may initially remember that
x
contains heights andy
contains genders, but after a few hours, a few days, or a few weeks, you probably will forget this. Better options areHeight
andGender
. - As much as possible, be consistent in how you name objects. In particular, decide how to separate multi-word names. Some options include:
- Using case to separate:
BloodPressure
orbloodPressure
for example - Using underscores to separate:
blood_pressure
for example - Using a period to separate:
blood.pressure
for example
- Using case to separate:
4.4 Missing Data, Infinity, etc.
Most real-world data sets have variables where some observations are missing. In a longitudinal study participants may drop out. In a survey, participants may decide not to respond to certain questions. Statistical software should be able to represent missing data and to analyze data sets in which some data are missing.
In R, the value NA
is used for a missing data value. Since missing values may occur in numeric, character, and other types of data, and since R requires that a vector contain only elements of one type, there are different types of NA
values. Usually R determines the appropriate type of NA
value automatically. It is worth noting that the default type for NA
is logical, and that NA
is NOT the same as the character string "NA"
.
[1] "dog" "cat" NA "pig" NA "horse"
[1] FALSE FALSE TRUE FALSE TRUE FALSE
[1] "dog" "cat" NA "pig" NA "horse"
[7] "NA"
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE
[1] "logical"
How should missing data be treated in computations, such as finding the mean or standard deviation of a variable? One possibility is to return NA
. Another is to remove the missing value(s) and then perform the computation.
[1] NA
[1] 2.75
As this example shows, the default behavior for the mean()
function is to return NA
. If removal of the missing values and then computing the mean is desired, the argument na.rm
is set to TRUE
. Different R functions have different default behaviors, and there are other possible actions. Consulting the help for a function provides the details.
4.4.1 Practice Problem
Collecting data is often a messy process resulting in multiple errors in the data. Consider the following small vector representing the weights of 10 adults in pounds.
As far as I know, it’s not possible for an adult to weigh 12 pounds, so that is most likely an error. Change this value to NA, and then find the standard deviation of the weights after removing the NA value.
4.4.2 Infinity and NaN
What happens if R code requests division by zero, or results in a number that is too large to be represented? Here are some examples.
[1] 0 1 2 3 4
[1] Inf 1.0000 0.5000 0.3333 0.2500
[1] NaN 1 1 1 1
[1] 1.024e+03 1.072e+301 Inf
Inf
and -Inf
represent infinity and negative infinity (and numbers which are too large in magnitude to be represented as floating point numbers). NaN
represents the result of a calculation where the result is undefined, such as dividing zero by zero. All of these are common to a variety of programming languages, including R.
4.5 Data Frames
Commonly, data is rectangular in form, with variables as columns and cases as rows. Continuing with the (contrived) data on weight, gender, and blood pressure medication, each of those variables would be a column of the data set, and each person’s measurements would be a row. In R, such data are represented as a data frame.
> healthData <- data.frame(Weight = weight, Gender=gender, bp.meds = bp,
+ stringsAsFactors=FALSE)
> healthData
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
[1] "Weight" "Gender" "bp.meds"
[1] "Weight" "Gender" "bp.meds"
Wt Gdr bp
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
[1] "1" "2" "3" "4" "5" "6" "7" "8"
The data.frame
function can be used to create a data frame (although it’s more common to read a data frame into R from an external file, something that will be introduced later). The names of the variables in the data frame are given as arguments, as are the vectors of data that make up the variable’s values. The argument stringsAsFactors=FALSE
asks R not to convert character vectors into factors, which R does by default, to the dismay of many users. Names of the columns (variables) can be extracted and set via either names
or colnames
. In the example, the variable names are changed to Wt, Gdr, bp
and then changed back to the original Weight, Gender, bp.meds
in this way. Rows can be named also. In this case since specific row names were not provided, the default row names of "1", "2"
etc. are used.
In the next example a built-in dataset called mtcars
is made available by the data
function, and then the first and last six rows are displayed using head
and tail
.
mpg cyl disp hp drat wt qsec
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
Datsun 710 22.8 4 108 93 3.85 2.320 18.61
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
Valiant 18.1 6 225 105 2.76 3.460 20.22
vs am gear carb
Mazda RX4 0 1 4 4
Mazda RX4 Wag 0 1 4 4
Datsun 710 1 1 4 1
Hornet 4 Drive 1 0 3 1
Hornet Sportabout 0 0 3 2
Valiant 1 0 3 1
mpg cyl disp hp drat wt qsec vs
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1
am gear carb
Porsche 914-2 1 5 2
Lotus Europa 1 5 2
Ford Pantera L 1 5 4
Ferrari Dino 1 5 6
Maserati Bora 1 5 8
Volvo 142E 1 4 2
Note that the mtcars
data frame does have non-default row names which give the make and model of the cars.
4.5.1 Accessing Specific Elements of Data Frames
Data frames are two-dimensional, so to access a specific element (or elements) we need to specify both the row and column.
[1] 110
[1] 160 160 108
cyl disp
Mazda RX4 6 160
Mazda RX4 Wag 6 160
Datsun 710 4 108
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Note that mtcars[,1]
returns ALL elements in the first column. This agrees with the behavior for vectors, where leaving a subscript out of the square brackets tells R to return all values. In this case we are telling R to return all rows, and the first column.
For a data frame there is another way to access specific columns, using the $
notation.
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8
[26] 4 4 4 8 6 8 4
Error in eval(expr, envir, enclos): object 'mpg' not found
Error in eval(expr, envir, enclos): object 'cyl' not found
[1] 123 157 202 199 223 140 105 194
Notice that typing the variable name, such as mpg
, without the name of the data frame (and a dollar sign) as a prefix, does not work. This is sensible. There may be several data frames that have variables named mpg
, and just typing mpg
doesn’t provide enough information to know which is desired. But if there is a vector named mpg
that is created outside a data frame, it will be retrieved when mpg
is typed, which is why typing weight
does work, since weight
was created outside of a data frame, although ultimately it was incorporated into the healthData
data frame.
4.6 Lists
The third main data structure we will work with is a list. Technically a list is a vector, but one in which elements can be of different types. For example a list may have one element that is a vector, one element that is a data frame, and another element that is a function. Consider designing a function that fits a simple linear regression model to two quantitative variables. We might want that function to compute and return several things such as
- The fitted slope and intercept (a numeric vector with two components)
- The residuals (a numeric vector with \(n\) components, where \(n\) is the number of data points)
- Fitted values for the data (a numeric vector with \(n\) components, where \(n\) is the number of data points)
- The names of the dependent and independent variables (a character vector with two components)
In fact R has a function, lm
, which does this (and much more).
[1] "list"
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
(Intercept) hp
30.09886 -0.06823
Mazda RX4 Mazda RX4 Wag
-1.59375 -1.59375
Datsun 710 Hornet 4 Drive
-0.95363 -1.19375
Hornet Sportabout Valiant
0.54109 -4.83489
Duster 360 Merc 240D
0.91707 -1.46871
Merc 230 Merc 280
-0.81717 -2.50678
Merc 280C Merc 450SE
-3.90678 -1.41777
Merc 450SL Merc 450SLC
-0.51777 -2.61777
Cadillac Fleetwood Lincoln Continental
-5.71206 -5.02978
Chrysler Imperial Fiat 128
0.29364 6.80421
Honda Civic Toyota Corolla
3.84901 8.23598
Toyota Corona Dodge Challenger
-1.98072 -4.36462
AMC Javelin Camaro Z28
-4.66462 -0.08293
Pontiac Firebird Fiat X1-9
1.04109 1.70421
Porsche 914-2 Lotus Europa
2.10991 8.01093
Ford Pantera L Ferrari Dino
3.71340 1.54109
Maserati Bora Volvo 142E
7.75761 -1.26198
The lm
function returns a list (which in the code above has been assigned to the object mpgHpLinMod
).23 One component of the list is the length 2 vector of coefficients, while another component is the length 32 vector of residuals. The code also illustrates that named components of a list can be accessed using the dollar sign notation, as with data frames.
The list
function is used to create lists.
> temporaryList <- list(first=weight, second=healthData,
+ pickle=list(a = 1:10, b=healthData))
> temporaryList
$first
[1] 123 157 202 199 223 140 105 194
$second
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
$pickle
$pickle$a
[1] 1 2 3 4 5 6 7 8 9 10
$pickle$b
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
Here, for illustration, I assembled a list to hold some of the R data structures we have been working with in this chapter. The first list element, named first
, holds the weight
vector we created in Section 4.1, the second list element, named second
, holds the healthData
data frame, and the third list element, named pickle
, holds a list with elements named a
and b
that hold a vector of values 1 through 10 and another copy of the healthData
data frame, respectively. As this example shows, a list can contain another list.
4.6.1 Accessing Specific Elements of Lists
We already have seen the dollar sign notation works for lists. In addition, the square bracket subsetting notation can be used. There is an added, somewhat subtle wrinkle—using either single or double square brackets.
[1] 123 157 202 199 223 140 105 194
[1] "numeric"
[1] 123 157 202 199 223 140 105 194
[1] "numeric"
$first
[1] 123 157 202 199 223 140 105 194
[1] "list"
Note the dollar sign and double bracket notation return a numeric vector, while the single bracket notation returns a list. Notice also the difference in results below.
$first
[1] 123 157 202 199 223 140 105 194
$second
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
[1] 157
The single bracket form returns the first and second elements of the list, while the double bracket form returns the second element in the first element of the list. Generally, do not put a vector of indices or names in a double bracket, you will likely get unexpected results. See, for example, the results below.24
Error in temporaryList[[c(1, 2, 3)]]: recursive indexing failed at level 2
So, in summary, there are two main differences between using the single bracket []
and double bracket [[]]
. First, the single bracket will return a list that holds the object(s) held at the given indices or names placed in the bracket, whereas the double brackets will return the actual object held at the index or name placed in the innermost bracket. Put differently, a single bracket can be used to access a range of list elements and will return a list, and a double bracket can only access a single element in the list and will return the object held at the index.
4.7 Subsetting with Logical Vectors
Consider the healthData
data frame. How can we access only those weights which are more than 200? How can we access the genders of those whose weights are more than 200? How can we compute the mean weight of males and the mean weight of females? Or consider the mtcars
data frame. How can we obtain the miles per gallon for all six cylinder cars? Both of these data sets are small enough that it would not be too onerous to extract the values by hand. But for larger or more complex data sets, this would be very difficult or impossible to do in a reasonable amount of time, and would likely result in errors.
R has a powerful method for solving these sorts of problems using a variant of the subsetting methods that we already have learned. When given a logical vector in square brackets, R will return the values corresponding to TRUE
.
To begin, focus on the weight
and gender
vectors created in Section 4.1.
The R code weight > 200
returns a TRUE
for each value of weight
which is more than 200, and a FALSE
for each value of weight
which is less than or equal to 200. Similarly gender == "female"
returns TRUE
or FALSE
depending on whether an element of gender
is equal to female
.
[1] 123 157 202 199 223 140 105 194
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
[1] "male" "male"
[1] 202 223
[1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
[1] 123 157 199 105
Consider the lines of R code one by one.
weight
instructs R to display the values in the vectorweight
.weight > 200
instructs R to check whether each value inweight
is greater than 200, and to returnTRUE
if so, andFALSE
otherwise.- The next line,
gender[weight > 200]
, does two things. First, inside the square brackets, it does the same thing as the second line, namely, returningTRUE
orFALSE
depending on whether a value ofweight
is or is not greater than 200. Second, each element ofgender
is matched with the correspondingTRUE
orFALSE
value, and is returned if and only if the corresponding value isTRUE
. For example the first value ofgender
isgender[1]
. Since the firstTRUE
orFALSE
value isFALSE
, the first value ofgender
is not returned. Only the third and fifth values ofgender
, both of which happen to bemale
, are returned. Briefly, this line returns the genders of those people whose weight is over 200 pounds. - The fourth line of code,
weight[weight > 200]
, again begins by returningTRUE
orFALSE
depending on whether elements ofweight
are larger than 200. Then those elements ofweight
corresponding toTRUE
values, are returned. So this line returns the weights of those people whose weights are more than 200 pounds. - The fifth line returns
TRUE
orFALSE
depending on whether elements ofgender
are equal tofemale
or not. - The sixth line returns the weights of those whose gender is
female
.
There are six comparison operators in R, >, <, >=, <=, ==, !=
. Note that to test for equality a “double equals sign” is used, while !=
tests for inequality.
4.7.1 Modifying or Creating Objects via Subsetting
The results of subsetting can be assigned to a new (or existing) R object, and subsetting on the left side of an assignment is a common way to modify an existing R object.
[1] 123 157 202 199 223 140 105 194
[1] 123 157 199 140 105 194
[1] 1 2 3 4 5 6 7 8 9 10
[1] 0 0 0 0 5 6 7 8 9 10
[1] -3 -2 -1 0 1 2 3 4 5 6 7 8 9
[1] NA NA NA 0 1 2 3 4 5 6 7 8 9
4.7.2 Logical Subsetting and Data Frames
First consider the small and simple healthData
data frame.
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
3 202 male FALSE
4 199 female FALSE
5 223 male TRUE
6 140 male FALSE
7 105 female TRUE
8 194 male FALSE
[1] 202 223 140 194
Weight Gender bp.meds
1 123 female FALSE
2 157 female TRUE
4 199 female FALSE
7 105 female TRUE
Gender bp.meds
3 male FALSE
4 female FALSE
5 male TRUE
8 male FALSE
The first example is really just subsetting a vector, since the $
notation creates vectors. The second two examples return subsets of the whole data frame. Note that the logical vector subsets the rows of the data frame, choosing those rows where the gender is female or the weight is more than 190. Note also that the specification for the columns (after the comma) is left blank in the first case, telling R to return all the columns. In the second case the second and third columns are requested explicitly.
Next consider the much larger and more complex WorldBank
data frame. Recall, the str
function displays the “structure” of an R object. Here is a look at the structure of several R objects.
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
List of 3
$ first : num [1:8] 123 157 202 199 223 140 105 194
$ second:'data.frame': 8 obs. of 3 variables:
..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
..$ Gender : chr [1:8] "female" "female" "male" "female" ...
..$ bp.meds: logi [1:8] FALSE TRUE FALSE FALSE TRUE FALSE ...
$ pickle:List of 2
..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ b:'data.frame': 8 obs. of 3 variables:
.. ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
.. ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
.. ..$ bp.meds: logi [1:8] FALSE TRUE FALSE FALSE TRUE FALSE ...
'data.frame': 11880 obs. of 15 variables:
$ iso2c : chr "AD" "AD" "AD" "AD" ...
$ country : chr "Andorra" "Andorra" "Andorra" "Andorra" ...
$ year : int 1978 1979 1977 2007 1976 2011 2012 2008 1980 1972 ...
$ fertility.rate : num NA NA NA 1.18 NA NA NA 1.25 NA NA ...
$ life.expectancy : num NA NA NA NA NA NA NA NA NA NA ...
$ population : num 33746 34819 32769 81292 31781 ...
$ GDP.per.capita.Current.USD : num 9128 11820 7751 39923 7152 ...
$ X15.to.25.yr.female.literacy: num NA NA NA NA NA NA NA NA NA NA ...
$ iso3c : chr "AND" "AND" "AND" "AND" ...
$ region : chr "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" ...
$ capital : chr "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" ...
$ longitude : num 1.52 1.52 1.52 1.52 1.52 ...
$ latitude : num 42.5 42.5 42.5 42.5 42.5 ...
$ income : chr "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" ...
$ lending : chr "Not classified" "Not classified" "Not classified" "Not classified" ...
First we see that mtcars
is a data frame which has 32 observations (rows) on each of 11 variables (columns). The names of the variables are given, along with their type (in this case, all numeric), and the first few values of each variable is given.
Second we see that temporaryList
is a list with three components. Each of the components is described separately, with the first few values again given.
Third we examine the structure of WorldBank
. It is a data frame with 11880 observations on each of 15 variables. Some of these are character variables, some are numeric, and one (year
) is integer. Looking at the first few values we see that some variables have missing values.
Consider creating a data frame which only has the observations from one year, say 1971. That’s relatively easy. Just choose rows for which year
is equal to 1971.
[1] 216 15
The dim
function returns the dimensions of a data frame, i.e., the number of rows and the number of columns. From dim
we see that there are dim(WorldBank1971)[1]
cases from 1971.
Next, how can we create a data frame which only contains data from 1971, and also only contains cases for which there are no missing values in the fertility rate variable? R has a built in function is.na
which returns TRUE
if the observation is missing and returns FALSE
otherwise. And !is.na
returns the negation, i.e., it returns FALSE
if the observation is missing and TRUE
if the observation is not missing.
[1] NA 6.512 7.671 3.517 4.933 3.118 7.264 3.104
[9] NA 2.200 2.961 2.788 4.479 2.260 2.775 2.949
[17] 6.942 2.210 6.657 2.100 6.293 7.329 6.786 NA
[25] 5.771
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[17] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[25] TRUE
[1] 193 15
From dim
we see that there are dim(WorldBank1971)[1]
cases from 1971 with non-missing fertility rate data.
Return attention now to the original WorldBank
data frame with data not only from 1971. How can we extract only those cases (rows) which have NO missing data? Consider the following simple example:
> temporaryDataFrame <- data.frame(V1 = c(1, 2, 3, 4, NA),
+ V2 = c(NA, 1, 4, 5, NA),
+ V3 = c(1, 2, 3, 5, 7))
> temporaryDataFrame
V1 V2 V3
1 1 NA 1
2 2 1 2
3 3 4 3
4 4 5 5
5 NA NA 7
V1 V2 V3
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] TRUE TRUE FALSE
[1] 1 0 0 0 2
First notice that is.na
will test each element of a data frame for missingness. Also recall that if R is asked to sum a logical vector, it will first convert the logical vector to numeric and then compute the sum, which effectively counts the number of elements in the logical vector which are TRUE
. The rowSums
function computes the sum of each row. So rowSums(is.na(temporaryDataFrame))
returns a vector with as many elements as there are rows in the data frame. If an element is zero, the corresponding row has no missing values. If an element is greater than zero, the value is the number of variables which are missing in that row. This gives a simple method to return all the cases which have no missing data.
[1] 11880 15
[1] 564 15
Out of the dim(WorldBankComplete)[1]
rows in the original data frame, only dim(WorldBankComplete)[1]
have no missing observations!
4.8 Patterned Data
Sometimes it is useful to generate all the integers from 1 through 20, to generate a sequence of 100 points equally spaced between 0 and 1, etc. The R functions seq()
and rep()
as well as the “colon operator” :
help to generate such sequences.
The colon operator generates a sequence of values with increments of \(1\) or \(-1\).
[1] 1 2 3 4 5 6 7 8 9 10
[1] -5 -4 -3 -2 -1 0 1 2 3
[1] 10 9 8 7 6 5 4
[1] 3.142 4.142 5.142 6.142
The seq()
function generates either a sequence of pre-specified length or a sequence with pre-specified increments.
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1] 1.000 1.333 1.667 2.000 2.333 2.667 3.000 3.333
[9] 3.667 4.000 4.333 4.667 5.000
[1] 3.0000 2.5556 2.1111 1.6667 1.2222 0.7778
[7] 0.3333 -0.1111 -0.5556 -1.0000
The rep()
function replicates the values in a given vector.
[1] 1 2 4 1 2 4 1 2 4
[1] 1 2 4 1 2 4 1 2 4
[1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"
4.8.1 Practice Problem
Often when using R you will want to simulate data from a specific probability distribution (i.e. normal/Gaussian, bionmial, Poisson). R has a vast suite of functions for working with statistical distributions. To generate values from a statistical distribution, the function has a name beginning with an “r” followed by some abbreviation of the probability distribution. For example to simulate from the three distributions mentioned above, we can use the functions rnorm()
, rbinom
, and rpois
.
Use the rnorm()
function to generate 10,000 values from the standard normal distribution (the normal distribution with mean = 0 and variance = 1). Consult the help page for rnorm()
if you need to. Save this vector of variables to a vector named sim.vals
. Then use the hist()
function to draw a histogram of the simulated data. Does the data look like it follows a normal distribution?
4.9 Exercises
Exercise 3 Learning objectives: create, subset, and manipulate vector contents and attributes; summarize vector data using R table()
and other functions; generate basic graphics using vector data.
Exercise 4 Learning objectives: use functions to describe data frame characteristics; summarize and generate basic graphics for variables held in data frames; apply the subset function with logical operators; illustrate NA
, NaN
, Inf
, and other special values occur; recognize the implications of using floating point arithmetic with logical operators.
Exercise 5 Learning objectives: practice with lists, data frames, and associated functions; summarize variables held in lists and data frames; work with R’s linear regression lm()
function output; review logical subsetting of vectors for partitioning and assigning of new values; generate and visualize data from mathematical functions.
Technically the objects described in this section are “atomic” vectors (all elements of the same type), since lists, to be described below, also are actually vectors. This will not be an important issue, and the shorter term vector will be used for atomic vectors below.↩
Missing data will be discussed in more detail later in the chapter.↩
The
mode
function returns the type or storage mode of an object.↩Try this example using only single brackets\(\ldots\) it will return a list holding elements
first
,second
, andpickle
.↩