Chapter 4 Data Structures

A data structure is a format for organizing and storing data. The structure is designed so that data can be accessed and worked with in specific ways. Statistical software and programming languages have methods (or functions) designed to operate on different kinds of data structures.

This chapter’s focus is on data structures. To help initial understanding, the data in this chapter will be relatively modest in size and complexity. The ideas and methods, however, generalize to larger and more complex data sets.

The base data structures in R are vectors, matrices, arrays, data frames, and lists. The first three, vectors, matrices, and arrays, require all elements to be of the same type or homogeneous, e.g., all numeric or all character. Data frames and lists allow elements to be of different types or heterogeneous, e.g., some elements of a data frame may be numeric while other elements may be character. These base structures can also be organized by their dimensionality, i.e., 1-dimensional, 2-dimensional, or N-dimensional, as shown in Table 4.1.

TABLE 4.1: Dimension and type content of base data structures in R
Dimension Homogeneous Heterogeneous
1 Atomic vector List
2 Matrix Data frame
N Array

R has no scalar types, i.e., 0-dimensional. Individual numbers or strings are actually vectors of length one.

An efficient way to understand what comprises a given object is to use the str() function. str() is short for structure and prints a compact, human-readable description of any R data structure. For example, in the code below, we prove to ourselves that what we might think of as a scalar value is actually a vector of length one.

 num 1
[1] TRUE
[1] 1

Here we assigned a the scalar value one. The str(a) prints num 1, which says a is numeric of length one. Then just to be sure we used the function is.vector() to test if a is in fact a vector. Then, just for fun, we asked the length of a, which again returns one. There are a set of similar logical tests for the other base data structures, e.g., is.matrix(), is.array(), is.data.frame(), and is.list(). These will all come in handy as we encounter different R objects.

4.1 Vectors

Think of a vector21 as a structure to represent one variable in a data set. For example a vector might hold the weights, in pounds, of 7 people in a data set. Or another vector might hold the genders of those 7 people. The c() function in R is useful for creating (small) vectors and for modifying existing vectors. Think of c as standing for “combine”.

[1] 123 157 205 199 223 140 105
[1] "female" "female" "male"   "female" "male"  
[6] "male"   "female"

Notice that elements of a vector are separated by commas when using the c() function to create a vector. Also notice that character values are placed inside quotation marks.

The c() function also can be used to add to an existing vector. For example, if an eighth male person was included in the data set, and his weight was 194 pounds, the existing vectors could be modified as follows.

[1] 123 157 205 199 223 140 105 194
[1] "female" "female" "male"   "female" "male"  
[6] "male"   "female" "male"  

4.1.1 Types, Conversion, Coercion

Clearly it is important to distinguish between different types of vectors. For example, it makes sense to ask R to calculate the mean of the weights stored in weight, but does not make sense to ask R to compute the mean of the genders stored in gender. Vectors in R may have one of six different “types”: character, double, integer, logical, complex, and raw. Only the first four of these will be of interest below, and the distinction between double and integer will not be of great import. To illustrate logical vectors, imagine that each of the eight people in the data set was asked whether he or she was taking blood pressure medication, and the responses were coded as TRUE if the person answered yes, and FALSE if the person answered no.

[1] "double"
[1] "character"
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
[1] "logical"

It may be surprising to see that the variable weight is of double type, even though its values all are integers. By default R creates a double type vector when numeric values are given via the c() function.

When it makes sense, it is possible to convert vectors to a different type. Consider the following examples.

[1] 123 157 205 199 223 140 105 194
[1] "integer"
[1] "123" "157" "205" "199" "223" "140" "105" "194"
[1] 0 1 0 0 1 0 1 0
Warning: NAs introduced by coercion
[1] NA NA NA NA NA NA NA NA
[1] 3

The integer version of weight doesn’t look any different, but it is stored differently, which can be important both for computational efficiency and for interfacing with other languages such as C++. As noted above, however, we will not worry about the distinction between integer and double types. Converting weight to character goes as expected: The character representations of the numbers replace the numbers themselves. Converting the logical vector bp to double is pretty straightforward too: FALSE is converted to zero, and TRUE is converted to one. Now think about converting the character vector gender to a numeric double vector. It’s not at all clear how to represent “female” and “male” as numbers. In fact in this case what R does is to create a character vector, but with each element set to NA, which is the representation of missing data.22 Finally consider the code sum(bp). Now bp is a logical vector, but when R sees that we are asking to sum this logical vector, it automatically converts it to a numerical vector and then adds the zeros and ones representing FALSE and TRUE.

R also has functions to test whether a vector is of a particular type.

[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE

4.1.1.1 Coercion

Consider the following examples.

[1] 1 2 3 1
[1] "1"   "2"   "3"   "dog"
[1] "TRUE"  "FALSE" "cat"  
[1] 123 158 205 199 224 140 106 194

Vectors in R can only contain elements of one type. If more than one type is included in a c() function, R silently coerces the vector to be of one type. The examples illustrate the hierarchy—if any element is a character, then the whole vector is character. If some elements are numeric (either integer or double) and other elements are logical, the whole vector is numeric. Note what happened when R was asked to add the numeric vector weight to the logical vector bp. The logical vector was silently coerced to be numeric, so that FALSE became zero and TRUE became one, and then the two numeric vectors were added.

4.1.2 Accessing Specific Elements of Vectors

To access and possibly change specific elements of vectors, refer to the position of the element in square brackets. For example, weight[4] refers to the fourth element of the vector weight. Note that R starts the numbering of elements at 1, i.e., the first element of a vector x is x[1].

[1] 123 157 205 199 223 140 105 194
[1] 223
[1] 123 157 205
[1] 8
[1] 194
[1] 123 157 205 199 223 140 105 194
[1] 123 157 202 199 223 140 105 194

Note that including nothing in the square brackets results in the whole vector being returned.

Negative numbers in the square brackets tell R to omit the corresponding value. And a zero as a subscript returns nothing (more precisely, it returns a length zero vector of the appropriate type).

[1] 123 157 199 223 140 105 194
[1] 123 157 202 199 223 140 105
[1] 157 199 140 105 194
numeric(0)
[1] 157 123
Error in weight[c(-1, 2)]: only 0's may be mixed with negative subscripts

Note that mixing zero and other nonzero subscripts is allowed, but mixing negative and positive subscripts is not allowed.

What about the (usual) case where we don’t know the positions of the elements we want? For example possibly we want the weights of all females in the data. Later we will learn how to subset using logical indices, which is a very powerful way to access desired elements of a vector.

4.1.3 Practice Problem

A bad programming technique that often plagues beginners is a technique called hardcoding. Consider the following simple vector containing data on the number of tree species found at different sites.

Suppose we are interested in the second to last value of the data set. One way to do this is to first determine the length of vector using the length() function, then taking that value and subtracting 1.

[1] 10
[1] 9

This is an example of hardcoding. But what if we attempt to use the same code on a second vector of tree species data that has a different number of sites?

[1] 6
[1] NA

That’s clearly not what we want. Fix this code so we can always extract the second to last value in the vector, regardless of the length of the vector.

4.2 Factors

Categorical variables such as gender can be represented as character vectors. In many cases this simple representation is sufficient. Consider, however, two other categorical variables, one representing age via categories youth, young adult, middle age, senior, and another representing income via categories lower, middle, and upper. Suppose that for the small health data set, all the people are either middle aged or senior citizens. If we just represented the variable via a character vector, there would be no way to know that there are two other categories, representing youth and young adults, which happen not to be present in the data set. And for the income variable, the character vector representation does not explicitly indicate that there is an ordering of the levels.

Factors in R provide a more sophisticated way to represent categorical variables. Factors explicitly contain all possible levels, and allow ordering of levels.

[1] "middle age" "senior"     "middle age" "senior"    
[5] "senior"     "senior"     "senior"     "middle age"
[1] "lower"  "lower"  "upper"  "middle" "upper" 
[6] "lower"  "lower"  "middle"
[1] middle age senior     middle age senior    
[5] senior     senior     senior     middle age
Levels: youth young adult middle age senior
[1] lower  lower  upper  middle upper  lower  lower 
[8] middle
Levels: lower < middle < upper

In the factor version of age the levels are explicitly listed, so it is clear that the two included levels are not all the possible levels. And in the factor version of income, the ordering is explicit.

In many cases the character vector representation of a categorical variable is sufficient and easier to work with. In this book, factors will not be used extensively. It is important to note that R often by default creates a factor when character data are read in, and sometimes it is necessary to use the argument stringsAsFactors = FALSE to explicitly tell R not to do this. This is shown later in the chapter when data frames are introduced.

4.3 Names of Objects in R

There are few hard and fast restrictions on the names of objects (such as vectors) in R. In addition to these restrictions, there are certain good practices, and many things to avoid as well.

From the help page for make.names in R, the name of an R object is “syntactically valid” if the name “consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number” and is not one of the “reserved words” in R such as if, TRUE, function, etc. For example, c45t.le_dog and .ty56 are both syntactically valid (although not very good names) while 4cats and log#@gopher are not.

A few important comments about naming objects follow:

  1. It is important to be aware that names of objects in R are case-sensitive, so weight and Weight do not refer to the same object.
[1] 123 157 202 199 223 140 105 194
Error in eval(expr, envir, enclos): object 'Weight' not found
  1. It is unwise to create an object with the same name as a built in R object such as the function c or the function mean. In earlier versions of R this could be somewhat disastrous, but even in current versions, it is definitely not a good idea!
  2. As much as possible, choose names that are informative. When creating a variable you may initially remember that x contains heights and y contains genders, but after a few hours, a few days, or a few weeks, you probably will forget this. Better options are Height and Gender.
  3. As much as possible, be consistent in how you name objects. In particular, decide how to separate multi-word names. Some options include:
    • Using case to separate: BloodPressure or bloodPressure for example
    • Using underscores to separate: blood_pressure for example
    • Using a period to separate: blood.pressure for example

4.4 Missing Data, Infinity, etc.

Most real-world data sets have variables where some observations are missing. In a longitudinal study participants may drop out. In a survey, participants may decide not to respond to certain questions. Statistical software should be able to represent missing data and to analyze data sets in which some data are missing.

In R, the value NA is used for a missing data value. Since missing values may occur in numeric, character, and other types of data, and since R requires that a vector contain only elements of one type, there are different types of NA values. Usually R determines the appropriate type of NA value automatically. It is worth noting that the default type for NA is logical, and that NA is NOT the same as the character string "NA".

[1] "dog"   "cat"   NA      "pig"   NA      "horse"
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
[1] "dog"   "cat"   NA      "pig"   NA      "horse"
[7] "NA"   
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[1] "logical"

How should missing data be treated in computations, such as finding the mean or standard deviation of a variable? One possibility is to return NA. Another is to remove the missing value(s) and then perform the computation.

[1] NA
[1] 2.75

As this example shows, the default behavior for the mean() function is to return NA. If removal of the missing values and then computing the mean is desired, the argument na.rm is set to TRUE. Different R functions have different default behaviors, and there are other possible actions. Consulting the help for a function provides the details.

4.4.1 Practice Problem

Collecting data is often a messy process resulting in multiple errors in the data. Consider the following small vector representing the weights of 10 adults in pounds.

As far as I know, it’s not possible for an adult to weigh 12 pounds, so that is most likely an error. Change this value to NA, and then find the standard deviation of the weights after removing the NA value.

4.4.2 Infinity and NaN

What happens if R code requests division by zero, or results in a number that is too large to be represented? Here are some examples.

[1] 0 1 2 3 4
[1]    Inf 1.0000 0.5000 0.3333 0.2500
[1] NaN   1   1   1   1
[1]  1.024e+03 1.072e+301        Inf

Inf and -Inf represent infinity and negative infinity (and numbers which are too large in magnitude to be represented as floating point numbers). NaN represents the result of a calculation where the result is undefined, such as dividing zero by zero. All of these are common to a variety of programming languages, including R.

4.5 Data Frames

Commonly, data is rectangular in form, with variables as columns and cases as rows. Continuing with the (contrived) data on weight, gender, and blood pressure medication, each of those variables would be a column of the data set, and each person’s measurements would be a row. In R, such data are represented as a data frame.

  Weight Gender bp.meds
1    123 female   FALSE
2    157 female    TRUE
3    202   male   FALSE
4    199 female   FALSE
5    223   male    TRUE
6    140   male   FALSE
7    105 female    TRUE
8    194   male   FALSE
[1] "Weight"  "Gender"  "bp.meds"
[1] "Weight"  "Gender"  "bp.meds"
   Wt    Gdr    bp
1 123 female FALSE
2 157 female  TRUE
3 202   male FALSE
4 199 female FALSE
5 223   male  TRUE
6 140   male FALSE
7 105 female  TRUE
8 194   male FALSE
[1] "1" "2" "3" "4" "5" "6" "7" "8"

The data.frame function can be used to create a data frame (although it’s more common to read a data frame into R from an external file, something that will be introduced later). The names of the variables in the data frame are given as arguments, as are the vectors of data that make up the variable’s values. The argument stringsAsFactors=FALSE asks R not to convert character vectors into factors, which R does by default, to the dismay of many users. Names of the columns (variables) can be extracted and set via either names or colnames. In the example, the variable names are changed to Wt, Gdr, bp and then changed back to the original Weight, Gender, bp.meds in this way. Rows can be named also. In this case since specific row names were not provided, the default row names of "1", "2" etc. are used.

In the next example a built-in dataset called mtcars is made available by the data function, and then the first and last six rows are displayed using head and tail.

                   mpg cyl disp  hp drat    wt  qsec
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
Datsun 710        22.8   4  108  93 3.85 2.320 18.61
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
Valiant           18.1   6  225 105 2.76 3.460 20.22
                  vs am gear carb
Mazda RX4          0  1    4    4
Mazda RX4 Wag      0  1    4    4
Datsun 710         1  1    4    1
Hornet 4 Drive     1  0    3    1
Hornet Sportabout  0  0    3    2
Valiant            1  0    3    1
                mpg cyl  disp  hp drat    wt qsec vs
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1
               am gear carb
Porsche 914-2   1    5    2
Lotus Europa    1    5    2
Ford Pantera L  1    5    4
Ferrari Dino    1    5    6
Maserati Bora   1    5    8
Volvo 142E      1    4    2

Note that the mtcars data frame does have non-default row names which give the make and model of the cars.

4.5.1 Accessing Specific Elements of Data Frames

Data frames are two-dimensional, so to access a specific element (or elements) we need to specify both the row and column.

[1] 110
[1] 160 160 108
              cyl disp
Mazda RX4       6  160
Mazda RX4 Wag   6  160
Datsun 710      4  108
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

Note that mtcars[,1] returns ALL elements in the first column. This agrees with the behavior for vectors, where leaving a subscript out of the square brackets tells R to return all values. In this case we are telling R to return all rows, and the first column.

For a data frame there is another way to access specific columns, using the $ notation.

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8
[26] 4 4 4 8 6 8 4
Error in eval(expr, envir, enclos): object 'mpg' not found
Error in eval(expr, envir, enclos): object 'cyl' not found
[1] 123 157 202 199 223 140 105 194

Notice that typing the variable name, such as mpg, without the name of the data frame (and a dollar sign) as a prefix, does not work. This is sensible. There may be several data frames that have variables named mpg, and just typing mpg doesn’t provide enough information to know which is desired. But if there is a vector named mpg that is created outside a data frame, it will be retrieved when mpg is typed, which is why typing weight does work, since weight was created outside of a data frame, although ultimately it was incorporated into the healthData data frame.

4.6 Lists

The third main data structure we will work with is a list. Technically a list is a vector, but one in which elements can be of different types. For example a list may have one element that is a vector, one element that is a data frame, and another element that is a function. Consider designing a function that fits a simple linear regression model to two quantitative variables. We might want that function to compute and return several things such as

  • The fitted slope and intercept (a numeric vector with two components)
  • The residuals (a numeric vector with \(n\) components, where \(n\) is the number of data points)
  • Fitted values for the data (a numeric vector with \(n\) components, where \(n\) is the number of data points)
  • The names of the dependent and independent variables (a character vector with two components)

In fact R has a function, lm, which does this (and much more).

[1] "list"
 [1] "coefficients"  "residuals"     "effects"      
 [4] "rank"          "fitted.values" "assign"       
 [7] "qr"            "df.residual"   "xlevels"      
[10] "call"          "terms"         "model"        
(Intercept)          hp 
   30.09886    -0.06823 
          Mazda RX4       Mazda RX4 Wag 
           -1.59375            -1.59375 
         Datsun 710      Hornet 4 Drive 
           -0.95363            -1.19375 
  Hornet Sportabout             Valiant 
            0.54109            -4.83489 
         Duster 360           Merc 240D 
            0.91707            -1.46871 
           Merc 230            Merc 280 
           -0.81717            -2.50678 
          Merc 280C          Merc 450SE 
           -3.90678            -1.41777 
         Merc 450SL         Merc 450SLC 
           -0.51777            -2.61777 
 Cadillac Fleetwood Lincoln Continental 
           -5.71206            -5.02978 
  Chrysler Imperial            Fiat 128 
            0.29364             6.80421 
        Honda Civic      Toyota Corolla 
            3.84901             8.23598 
      Toyota Corona    Dodge Challenger 
           -1.98072            -4.36462 
        AMC Javelin          Camaro Z28 
           -4.66462            -0.08293 
   Pontiac Firebird           Fiat X1-9 
            1.04109             1.70421 
      Porsche 914-2        Lotus Europa 
            2.10991             8.01093 
     Ford Pantera L        Ferrari Dino 
            3.71340             1.54109 
      Maserati Bora          Volvo 142E 
            7.75761            -1.26198 

The lm function returns a list (which in the code above has been assigned to the object mpgHpLinMod).23 One component of the list is the length 2 vector of coefficients, while another component is the length 32 vector of residuals. The code also illustrates that named components of a list can be accessed using the dollar sign notation, as with data frames.

The list function is used to create lists.

$first
[1] 123 157 202 199 223 140 105 194

$second
  Weight Gender bp.meds
1    123 female   FALSE
2    157 female    TRUE
3    202   male   FALSE
4    199 female   FALSE
5    223   male    TRUE
6    140   male   FALSE
7    105 female    TRUE
8    194   male   FALSE

$pickle
$pickle$a
 [1]  1  2  3  4  5  6  7  8  9 10

$pickle$b
  Weight Gender bp.meds
1    123 female   FALSE
2    157 female    TRUE
3    202   male   FALSE
4    199 female   FALSE
5    223   male    TRUE
6    140   male   FALSE
7    105 female    TRUE
8    194   male   FALSE

Here, for illustration, I assembled a list to hold some of the R data structures we have been working with in this chapter. The first list element, named first, holds the weight vector we created in Section 4.1, the second list element, named second, holds the healthData data frame, and the third list element, named pickle, holds a list with elements named a and b that hold a vector of values 1 through 10 and another copy of the healthData data frame, respectively. As this example shows, a list can contain another list.

4.6.1 Accessing Specific Elements of Lists

We already have seen the dollar sign notation works for lists. In addition, the square bracket subsetting notation can be used. There is an added, somewhat subtle wrinkle—using either single or double square brackets.

[1] 123 157 202 199 223 140 105 194
[1] "numeric"
[1] 123 157 202 199 223 140 105 194
[1] "numeric"
$first
[1] 123 157 202 199 223 140 105 194
[1] "list"

Note the dollar sign and double bracket notation return a numeric vector, while the single bracket notation returns a list. Notice also the difference in results below.

$first
[1] 123 157 202 199 223 140 105 194

$second
  Weight Gender bp.meds
1    123 female   FALSE
2    157 female    TRUE
3    202   male   FALSE
4    199 female   FALSE
5    223   male    TRUE
6    140   male   FALSE
7    105 female    TRUE
8    194   male   FALSE
[1] 157

The single bracket form returns the first and second elements of the list, while the double bracket form returns the second element in the first element of the list. Generally, do not put a vector of indices or names in a double bracket, you will likely get unexpected results. See, for example, the results below.24

Error in temporaryList[[c(1, 2, 3)]]: recursive indexing failed at level 2

So, in summary, there are two main differences between using the single bracket [] and double bracket [[]]. First, the single bracket will return a list that holds the object(s) held at the given indices or names placed in the bracket, whereas the double brackets will return the actual object held at the index or name placed in the innermost bracket. Put differently, a single bracket can be used to access a range of list elements and will return a list, and a double bracket can only access a single element in the list and will return the object held at the index.

4.7 Subsetting with Logical Vectors

Consider the healthData data frame. How can we access only those weights which are more than 200? How can we access the genders of those whose weights are more than 200? How can we compute the mean weight of males and the mean weight of females? Or consider the mtcars data frame. How can we obtain the miles per gallon for all six cylinder cars? Both of these data sets are small enough that it would not be too onerous to extract the values by hand. But for larger or more complex data sets, this would be very difficult or impossible to do in a reasonable amount of time, and would likely result in errors.

R has a powerful method for solving these sorts of problems using a variant of the subsetting methods that we already have learned. When given a logical vector in square brackets, R will return the values corresponding to TRUE. To begin, focus on the weight and gender vectors created in Section 4.1.

The R code weight > 200 returns a TRUE for each value of weight which is more than 200, and a FALSE for each value of weight which is less than or equal to 200. Similarly gender == "female" returns TRUE or FALSE depending on whether an element of gender is equal to female.

[1] 123 157 202 199 223 140 105 194
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
[1] "male" "male"
[1] 202 223
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
[1] 123 157 199 105

Consider the lines of R code one by one.

  • weight instructs R to display the values in the vector weight.
  • weight > 200 instructs R to check whether each value in weight is greater than 200, and to return TRUE if so, and FALSE otherwise.
  • The next line, gender[weight > 200], does two things. First, inside the square brackets, it does the same thing as the second line, namely, returning TRUE or FALSE depending on whether a value of weight is or is not greater than 200. Second, each element of gender is matched with the corresponding TRUE or FALSE value, and is returned if and only if the corresponding value is TRUE. For example the first value of gender is gender[1]. Since the first TRUE or FALSE value is FALSE, the first value of gender is not returned. Only the third and fifth values of gender, both of which happen to be male, are returned. Briefly, this line returns the genders of those people whose weight is over 200 pounds.
  • The fourth line of code, weight[weight > 200], again begins by returning TRUE or FALSE depending on whether elements of weight are larger than 200. Then those elements of weight corresponding to TRUE values, are returned. So this line returns the weights of those people whose weights are more than 200 pounds.
  • The fifth line returns TRUE or FALSE depending on whether elements of gender are equal to female or not.
  • The sixth line returns the weights of those whose gender is female.

There are six comparison operators in R, >, <, >=, <=, ==, !=. Note that to test for equality a “double equals sign” is used, while != tests for inequality.

4.7.1 Modifying or Creating Objects via Subsetting

The results of subsetting can be assigned to a new (or existing) R object, and subsetting on the left side of an assignment is a common way to modify an existing R object.

[1] 123 157 202 199 223 140 105 194
[1] 123 157 199 140 105 194
 [1]  1  2  3  4  5  6  7  8  9 10
 [1]  0  0  0  0  5  6  7  8  9 10
 [1] -3 -2 -1  0  1  2  3  4  5  6  7  8  9
 [1] NA NA NA  0  1  2  3  4  5  6  7  8  9

4.7.2 Logical Subsetting and Data Frames

First consider the small and simple healthData data frame.

  Weight Gender bp.meds
1    123 female   FALSE
2    157 female    TRUE
3    202   male   FALSE
4    199 female   FALSE
5    223   male    TRUE
6    140   male   FALSE
7    105 female    TRUE
8    194   male   FALSE
[1] 202 223 140 194
  Weight Gender bp.meds
1    123 female   FALSE
2    157 female    TRUE
4    199 female   FALSE
7    105 female    TRUE
  Gender bp.meds
3   male   FALSE
4 female   FALSE
5   male    TRUE
8   male   FALSE

The first example is really just subsetting a vector, since the $ notation creates vectors. The second two examples return subsets of the whole data frame. Note that the logical vector subsets the rows of the data frame, choosing those rows where the gender is female or the weight is more than 190. Note also that the specification for the columns (after the comma) is left blank in the first case, telling R to return all the columns. In the second case the second and third columns are requested explicitly.

Next consider the much larger and more complex WorldBank data frame. Recall, the str function displays the “structure” of an R object. Here is a look at the structure of several R objects.

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
List of 3
 $ first : num [1:8] 123 157 202 199 223 140 105 194
 $ second:'data.frame': 8 obs. of  3 variables:
  ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
  ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
  ..$ bp.meds: logi [1:8] FALSE TRUE FALSE FALSE TRUE FALSE ...
 $ pickle:List of 2
  ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
  ..$ b:'data.frame':   8 obs. of  3 variables:
  .. ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
  .. ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
  .. ..$ bp.meds: logi [1:8] FALSE TRUE FALSE FALSE TRUE FALSE ...
'data.frame':   11880 obs. of  15 variables:
 $ iso2c                       : chr  "AD" "AD" "AD" "AD" ...
 $ country                     : chr  "Andorra" "Andorra" "Andorra" "Andorra" ...
 $ year                        : int  1978 1979 1977 2007 1976 2011 2012 2008 1980 1972 ...
 $ fertility.rate              : num  NA NA NA 1.18 NA NA NA 1.25 NA NA ...
 $ life.expectancy             : num  NA NA NA NA NA NA NA NA NA NA ...
 $ population                  : num  33746 34819 32769 81292 31781 ...
 $ GDP.per.capita.Current.USD  : num  9128 11820 7751 39923 7152 ...
 $ X15.to.25.yr.female.literacy: num  NA NA NA NA NA NA NA NA NA NA ...
 $ iso3c                       : chr  "AND" "AND" "AND" "AND" ...
 $ region                      : chr  "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" ...
 $ capital                     : chr  "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" ...
 $ longitude                   : num  1.52 1.52 1.52 1.52 1.52 ...
 $ latitude                    : num  42.5 42.5 42.5 42.5 42.5 ...
 $ income                      : chr  "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" ...
 $ lending                     : chr  "Not classified" "Not classified" "Not classified" "Not classified" ...

First we see that mtcars is a data frame which has 32 observations (rows) on each of 11 variables (columns). The names of the variables are given, along with their type (in this case, all numeric), and the first few values of each variable is given.

Second we see that temporaryList is a list with three components. Each of the components is described separately, with the first few values again given.

Third we examine the structure of WorldBank. It is a data frame with 11880 observations on each of 15 variables. Some of these are character variables, some are numeric, and one (year) is integer. Looking at the first few values we see that some variables have missing values.

Consider creating a data frame which only has the observations from one year, say 1971. That’s relatively easy. Just choose rows for which year is equal to 1971.

[1] 216  15

The dim function returns the dimensions of a data frame, i.e., the number of rows and the number of columns. From dim we see that there are dim(WorldBank1971)[1] cases from 1971.

Next, how can we create a data frame which only contains data from 1971, and also only contains cases for which there are no missing values in the fertility rate variable? R has a built in function is.na which returns TRUE if the observation is missing and returns FALSE otherwise. And !is.na returns the negation, i.e., it returns FALSE if the observation is missing and TRUE if the observation is not missing.

 [1]    NA 6.512 7.671 3.517 4.933 3.118 7.264 3.104
 [9]    NA 2.200 2.961 2.788 4.479 2.260 2.775 2.949
[17] 6.942 2.210 6.657 2.100 6.293 7.329 6.786    NA
[25] 5.771
 [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [9] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[17]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[25]  TRUE
[1] 193  15

From dim we see that there are dim(WorldBank1971)[1] cases from 1971 with non-missing fertility rate data.

Return attention now to the original WorldBank data frame with data not only from 1971. How can we extract only those cases (rows) which have NO missing data? Consider the following simple example:

  V1 V2 V3
1  1 NA  1
2  2  1  2
3  3  4  3
4  4  5  5
5 NA NA  7
        V1    V2    V3
[1,] FALSE  TRUE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,]  TRUE  TRUE FALSE
[1] 1 0 0 0 2

First notice that is.na will test each element of a data frame for missingness. Also recall that if R is asked to sum a logical vector, it will first convert the logical vector to numeric and then compute the sum, which effectively counts the number of elements in the logical vector which are TRUE. The rowSums function computes the sum of each row. So rowSums(is.na(temporaryDataFrame)) returns a vector with as many elements as there are rows in the data frame. If an element is zero, the corresponding row has no missing values. If an element is greater than zero, the value is the number of variables which are missing in that row. This gives a simple method to return all the cases which have no missing data.

[1] 11880    15
[1] 564  15

Out of the dim(WorldBankComplete)[1] rows in the original data frame, only dim(WorldBankComplete)[1] have no missing observations!

4.8 Patterned Data

Sometimes it is useful to generate all the integers from 1 through 20, to generate a sequence of 100 points equally spaced between 0 and 1, etc. The R functions seq() and rep() as well as the “colon operator” : help to generate such sequences.

The colon operator generates a sequence of values with increments of \(1\) or \(-1\).

 [1]  1  2  3  4  5  6  7  8  9 10
[1] -5 -4 -3 -2 -1  0  1  2  3
[1] 10  9  8  7  6  5  4
[1] 3.142 4.142 5.142 6.142

The seq() function generates either a sequence of pre-specified length or a sequence with pre-specified increments.

 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
 [1] 1.000 1.333 1.667 2.000 2.333 2.667 3.000 3.333
 [9] 3.667 4.000 4.333 4.667 5.000
 [1]  3.0000  2.5556  2.1111  1.6667  1.2222  0.7778
 [7]  0.3333 -0.1111 -0.5556 -1.0000

The rep() function replicates the values in a given vector.

[1] 1 2 4 1 2 4 1 2 4
[1] 1 2 4 1 2 4 1 2 4
 [1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"

4.8.1 Practice Problem

Often when using R you will want to simulate data from a specific probability distribution (i.e. normal/Gaussian, bionmial, Poisson). R has a vast suite of functions for working with statistical distributions. To generate values from a statistical distribution, the function has a name beginning with an “r” followed by some abbreviation of the probability distribution. For example to simulate from the three distributions mentioned above, we can use the functions rnorm(), rbinom, and rpois.

Use the rnorm() function to generate 10,000 values from the standard normal distribution (the normal distribution with mean = 0 and variance = 1). Consult the help page for rnorm() if you need to. Save this vector of variables to a vector named sim.vals. Then use the hist() function to draw a histogram of the simulated data. Does the data look like it follows a normal distribution?

4.9 Exercises

Exercise 3 Learning objectives: create, subset, and manipulate vector contents and attributes; summarize vector data using R table() and other functions; generate basic graphics using vector data.

Exercise 4 Learning objectives: use functions to describe data frame characteristics; summarize and generate basic graphics for variables held in data frames; apply the subset function with logical operators; illustrate NA, NaN, Inf, and other special values occur; recognize the implications of using floating point arithmetic with logical operators.

Exercise 5 Learning objectives: practice with lists, data frames, and associated functions; summarize variables held in lists and data frames; work with R’s linear regression lm() function output; review logical subsetting of vectors for partitioning and assigning of new values; generate and visualize data from mathematical functions.


  1. Technically the objects described in this section are “atomic” vectors (all elements of the same type), since lists, to be described below, also are actually vectors. This will not be an important issue, and the shorter term vector will be used for atomic vectors below.

  2. Missing data will be discussed in more detail later in the chapter.

  3. The mode function returns the type or storage mode of an object.

  4. Try this example using only single brackets\(\ldots\) it will return a list holding elements first, second, and pickle.