Chapter 7 Functions and Programming

We have been working with a wide variety of R functions, from simple functions such as mean() and sd() to more complex functions such as ggplot() and apply(). Gaining a better understanding of existing functions and the ability to write your own functions dramatically increases what we can do with R. Learning about R’s programming capabilities is an important step in gaining facility with functions.

7.1 R Functions

Data on the yield (pounds per acre) of two types of corn seeds (regular and kiln dried) were collected. Each of the 11 plots of land was split into two subplots, and one of the subplots was planted in regular corn while the other was planted in kiln dried corn. These data were analyzed in a famous paper authored by William Gosset. Here are the data.

   regular kiln_dried
1     1903       2009
2     1935       1915
3     1910       2011
4     2496       2463
5     2108       2180
6     1961       1925
7     2060       2122
8     1444       1482
9     1612       1542
10    1316       1443
11    1511       1535

A paired t test, or a confidence interval for the mean difference, may be used to assess the difference in yield between the two varieties. Of course R has a function t.test that performs a paired t test and computes a confidence interval, but we will perform the test without using that function. We will focus for now on testing the hypotheses \(H_0\colon \mu_d = 0\) versus \(H_a\colon \mu_d \neq 0\) and on a two-sided confidence interval for \(\mu_d\). Here \(\mu_d\) represents the population mean difference.

The paired \(t\) statistic is defined by \[\begin{equation} t = \frac{\overline d}{S_d/\sqrt{n}} \end{equation}\]

where \(\overline d\) is the mean of the differences, \(S_d\) is the standard deviation of the differences, and \(n\) is the sample size. The p-value is twice the area to the right of \(|t_{\text{obs}}|\), where \(t_{\text{obs}}\) is the observed \(t\) statistic, and a confidence interval is given by \[\begin{equation} \overline d \pm t^* (S_d/\sqrt{n}). \end{equation}\]

Here \(t^*\) is an appropriate quantile of a \(t\) distribution with \(n-1\) degrees of freedom.

[1] 1.69
[1] 0.1218
[1] -10.73
[1] 78.18

With a few lines of R code we have calculated the t statistic, the p-value, and the confidence interval. Since paired t tests are pretty common, however, it would be helpful to automate this procedure. One obvious reason is to save time and effort, but another important reason is to avoid mistakes. It would be easy to make a mistake (e.g., using \(n\) instead of \(n-1\) as the degrees of freedom) when repeating the above computations.

Here is a first basic function which automates the computation.

And here is the function in action

$tstat
[1] 1.69

$pval
[1] 0.1218

$lcl
[1] -10.73

$ucl
[1] 78.18

An explanation and comments on the function are in order.

  • paired_t <- function(x1, x2) assigns a function of two variables, x1 and x2, to an R object called paired_t.
  • The compound expression, i.e., the code that makes up the body of the function, is enclosed in curly braces {}.
  • return(list(tstat = tstat, pval = pval, lcl=lcl, ucl=ucl)) indicates the object(s) returned by the function. In this case the function returns a list with four components.
  • The body of the function basically mimics the computations required to compute the t statistic, the p-value, and the confidence interval.
  • Several objects such as n and dbar are created in the body of the function. These objects are NOT available outside the function. We will discuss this further when we cover environments and scope in R.

Our function has automated the basic calculations. But it is still somewhat limited in usefulness. For example, it only computes a 95% confidence interval, while a user may want a different confidence level. And the function only performs a two-sided test, while a user may want a one-sided procedure. We modify the function slightly to allow the user to specify the confidence level next.

$tstat
[1] 1.69

$pval
[1] 0.1218

$lcl
[1] -10.73

$ucl
[1] 78.18
$tstat
[1] 1.69

$pval
[1] 0.1218

$lcl
[1] -29.5

$ucl
[1] 96.96

Two things to note. First, arguments do not have to be named. So

paired_t(corn$kiln_dried, corn$regular)

and

paired_t(x1 = corn$kiln_dried, x2 = corn$regular)

are equivalent. But we need to be careful if we do not name arguments because then we have to know the ordering of the arguments in the function declaration.

Second, in the declaration of the function, the third argument cl was given a default value of 0.95. If a user does not specify a value for cl it will silently be set to 0.95. But of course a user can override this, as we did in

paired_t(corn$kiln_dried, corn$regular, cl = 0.99)

7.1.1 Practice Problem

Like all things in R, getting the hang of writing functions just requires practice. Create a simple function called FtoK that is given a temperature in Farenheit and converts it to Kelvin using the following formula

\[ K = (F - 32) * \frac{5}{9} + 273.15 \]

You should get the following if your function is correct

[1] 299.8

7.1.2 Creating Functions

Creating very short functions at the command prompt is a reasonable strategy. For longer functions, one option is to write the function in a script and then submit the whole function. Or a function can be written in any text editor, saved as a plain text file (possibly with a .r extension), and then read into R using the source() function.

7.2 Programming: Conditional Statements

The paired_t function is somewhat useful, but could be improved in several ways. For example, consider the following:

Warning in x1 - x2: longer object length is not a
multiple of shorter object length

Warning in x1 - x2: longer object length is not a
multiple of shorter object length
$tstat
[1] 1

$pval
[1] 0.3739

$lcl
[1] -1.421

$ucl
[1] 3.021

The user specified data had different numbers of observations in x1 and x2, which of course can’t be data tested by a paired t test. Rather than stopping and letting the user know that this is a problem, the function continued and produced meaningless output.

Also, the function as written only allows testing against a two-sided alternative hypothesis, and it would be good to allow one-sided alternatives.

First we will address some checks on arguments specified by the user. For this we will use an if() function and a stop() function.

Error in paired_t(1:5, 1:4): The input vectors  must have the same length

The argument to the if() function is evaluated. If the argument returns TRUE the ensuing code is executed. Otherwise, the ensuing code is skipped and the rest of the function is evaluated. If a stop() function is executed, the function is exited and the argument of stop() is printed.

To better understand and use if() statements, we need to understand comparison operators and logical operators.

7.2.1 Comparison and Logical Operators

We have made use of some of the comparison operators in R. These include

  • Equal: ==
  • Not equal: !=
  • Greater than: >
  • Less than: <
  • Greater than or equal to: >=
  • Less than or equal to: <=

Special care needs to be taken with the == and != operators because of how numbers are represented on computers, see Section 7.3.

There are also three logical operators, with two variants of the “and” operator and the “or” operator.

  • and: Either & or &&
  • or: Either | or ||
  • not: !

The “double” operators && and || just examine the first element of the two vectors, whereas the “single” operators & and | compare element by element.

[1] TRUE
[1]  TRUE  TRUE FALSE
[1] FALSE
[1] FALSE  TRUE FALSE

We can use the logical operators to check whether a user-specified confidence level is between 0 and 1.

Error in paired_t(1:5, 2:6, cl = 15): The confidence level must be between 0 and 1

7.2.2 If else statements

The if() statement we have used so far has the form

if (condition) {
  expression
}

Often we want to evaluate one expression if the condition is true, and evaluate a different expression if the condition is false. That is accomplished by the if else statement. Here we determine whether a number is positive, negative, or zero.

[1] "the number is positive"
[1] "the number is negative"
[1] "the number is zero"

Notice the “different expression” for the first if statement was itself an if statement.

Next we modify the paired_t function to allow two-sided and one-sided alternatives.

$tstat
[1] 1.69

$pval
[1] 0.1218

$lcl
[1] -10.73

$ucl
[1] 78.18
$tstat
[1] 1.69

$pval
[1] 0.9391

$lcl
[1] -Inf

$ucl
[1] 69.89
$tstat
[1] 1.69

$pval
[1] 0.06091

$lcl
[1] -2.434

$ucl
[1] Inf

7.3 Computer Arithmetic

Like most software, R does not perform exact arithmetic. Rather, R follows the IEEE 754 floating point standards. This can have profound effects on how computational algorithms are implemented, but is also important when considering things like comparisons.

Note first that computer arithmetic does not follow some of the rules of ordinary arithmetic. For example, it is not associative.

[1] 0.0000000009313
[1] 0.0000000009313
[1] 0

Computer arithmetic is not exact.

[1] 0.1
[1] FALSE
[1] 0.00000000000000008327

So for example an if statement that uses an equality test may not give the expected answer. One way to avoid this problem is to test “near equality” using all.equal(). The function takes as arguments two objects to be compared, and a tolerance. If the objects are within the tolerance of each other, the function returns TRUE. The tolerance has a default value of about \(1.5\times 10^{-8}\), which works well in many cases.

[1] TRUE

7.4 Loops

Loops are an important component of any programming language, including R. Vectorized calculations and functions such as apply() make loops a bit less central to R than to many other languages, but an understanding of the three looping structures in R is still quite important.

We will investigate loops in the context of computing what is sometimes called the “machine epsilon.” Because of the inexact representation of numbers in R (and other languages) sometimes R cannot distinguish between the numbers 1 and |1 + x| for small values of x. The smallest value of x such that 1 and |1+x| are not declared equal is the machine epsilon.

[1] FALSE
[1] TRUE

Clearly the machine epsilon is somewhere between \(10^{-4}\) and \(10^{-50}\). How can we find its value exactly? Since floating point numbers use a binary representation, we know that the machine epsilon will be equal to \(1/2^k\) for some value of \(k\). So to find the machine epsilon, we can keep testing whether \(1\) and \(1+1/2^k\) are equal, until we find a value \(k\) where the two are equal. Then the machine epsilon will be \(1/2^{k-1}\), since it is the smallest value for which the two are NOT equal.

[1] FALSE
[1] FALSE
[1] FALSE

Testing by hand like this gets tedious quickly. A loop can automate the process. We will do this with two R loop types, repeat and while.

7.4.1 A Repeat Loop

A repeat loop just repeats a given expression over and over again until a break statement is encountered.

[1] 53
[1] 0.000000000000000222

This code initializes k at 1. The body of the loop initially checks whether \(1\) and \(1+1/2^k\) are equal. If they are equal, the break statement is executed and control is transferred outside the loop. If they are not equal, k is increased by 1, and we return to the beginning of the top of the body of the loop.

7.4.2 A While Loop

A while loop has the form

while (condition) {
  expression
}

As long as the condition is TRUE the expression is evaluated. Once the condition is FALSE control is transferred outside the loop.

[1] 53
[1] 0.000000000000000222

7.4.3 A For Loop

A for loop has the form

for (variable in vector) {
  expression
}

The for loop sets the variable equal to each element of the vector in succession, and evaluates the expression each time. Here are two different ways to use a for loop to calculate the sum of the elements in a vector.

[1] 55
[1] 55

In the first case we loop over the positions of the vector elements, while in the second case we loop over the elements themselves.

7.4.4 Practice Problem

Often when students initially learn about if() statements and for() loops, they use them more often than is necessary, leading to over complications and often slower, less sophisticated code. Many simple if() statements can be accomplished using logical subsetting and vectorization. Using the ideas you used above, and what you learned in Chapters 4 and 6, we can use logical subsetting to replace the simplest of if() statements. Rewrite the following if() statement and for() loop using logical subsetting.

7.5 Efficiency Considerations

In many contexts R and modern computers are fast enough that the user does not need to worry about writing efficient code. There are a few simple ways to write efficient code that are easy enough, and provide enough speed-up, that they are worth following as often as possible. The R function system.time() reports how long a set of R code takes to execute, and we will use this function to compare different ways to accomplish objectives in R.

7.5.1 Growing Objects

Consider two ways to create a sequence of integers from 1 to n, implemented in functions f1 and f2.

  1. Start with a zero-length vector, and let it grow:
  1. Start with a vector of length \(n\) and fill in the values:

Here are the two functions in action, with \(n = 100000\).

   user  system elapsed 
 11.393   0.036  11.432 
   user  system elapsed 
  0.014   0.000   0.015 

It is much more efficient to start with a full-length vector and then fill in values.37

Of course another way to create a vector of integers from 1 to n is to use 1:n. Let’s see how fast this is.

   user  system elapsed 
      0       0       0 
   user  system elapsed 
      0       0       0 

For \(n=100000\) this is so fast the system time is very close to zero. Even when \(n\) is 1000000 the time is very small. So another important lesson is to use built-in R functions whenever possible, since they have had substantial development focused on efficiency, correctness, etc.

7.5.2 Vectorization

Next consider calculating the sum of the squared entries in each column of a matrix. For example with the matrix \(M\),

\[ M = \left(\begin{array}{ccc} 1 & 2 & 3 \\ 4 & 5 & 6 \end{array}\right), \]

the sums would be \(1^2 + 4^2 = 17\), \(2^2 + 5^2 = 29\), and \(3^2 + 6^2 = 45\). One possibility is to have an outer loop that traverses the columns and an inner loop that traverses the rows within a column, squaring the entries and adding them together.

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[1] 17 29 45

Another possibility eliminates the inner loop, using the sum() function to compute the sum of the squared entries in the column directly.

[1] 17 29 45

A third possibility uses the colSums() function.

[1] 17 29 45

Here is a speed comparison, using a \(1000\times 10000\) matrix.

   user  system elapsed 
  2.454   0.000   2.454 
   user  system elapsed 
  0.080   0.016   0.096 
   user  system elapsed 
  0.128   0.024   0.152 

7.6 More on Functions

Understanding functions deeply requires a careful study of R’s scoping rules, as well as a good understanding of environments in R. That’s beyond the scope of this book, but we will briefly discuss some issues that are most salient. For a more in-depth treatment, see “Advanced R” by Hadley Wickham, especially the chapters on functions and environments.

7.7 Calling Functions

When using a function, the functions arguments can be specified in three ways:

  1. By the full name of the argument.
  2. By the position of the argument.
  3. By a partial name of the argument.
[1] 34 15 11 99

Positional matching of arguments is convenient, but should be used carefully, and probably limited to the first few, and most commonly used, arguments in a function. Partial does have pitfalls. A partially specified argument must unambiguously match exactly one argument—a requirement that’s not met below.

Error in tmp_function(1, f = 2): argument 2 matches multiple formal arguments

7.7.1 The ... argument

In defining a function, a special argument denoted by ... can be used. Sometimes this is called the “ellipsis” argument, sometimes the “three dot” argument, sometimes the “dot dot dot” argument, etc. The R language definition https://cran.r-project.org/doc/manuals/r-release/R-lang.html describes the argument in this way:

The special type of argument `…’ can contain any number of supplied arguments. It is used for a variety of purposes. It allows you to write a function that takes an arbitrary number of arguments. It can be used to absorb some arguments into an intermediate function which can then be extracted by functions called subsequently.

Consider for example the sum() function.

[1] 15
[1] 112
[1] 118

Think about writing such a function. There is no way to predict in advance the number of arguments a user might specify. So the function is defined with ... as the first argument:

function (..., na.rm = FALSE)  .Primitive("sum")

This is true of many commonly-used functions in R such as c() among others.

Next, consider a function that calls another function in its body. For example, suppose that a collaborator always supplies comma delimited files that have five lines of description, followed by a line containing variable names, followed by the data. You are tired of having to specify skip = 5, header = TRUE, and sep = "," to read.table() and want to create a function my.read() which uses these as defaults.

The ... in the definition of my.read() allows the user to specify other arguments, for example, stringsAsFactors = FALSE. These will be passed on to the read.table() function. In fact, that is how read.csv() is defined.

function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
    fill = TRUE, comment.char = "", ...) 
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x7fe63a1d5b60>
<environment: namespace:utils>

7.7.2 Lazy Evaluation

Arguments to R functions are not evaluated until they are needed, sometimes called lazy evaluation.

[1] 9
[1] 27
[1] 6
[1] 9
[1] 27
Error in print(a * b): argument "b" is missing, with no default

The first call specified both of the arguments a and b, and produced the expected output. In the second call the argument b was not specified. Since it was not needed until the third print statement, R happily executed the first two print statements, and only reported an error in the third statement, when b was needed to compute a*b.

[1] 16
[1] 20

In the first call, since b was not specified, it was computed as a^3. In the second call, b was specified, and the specified value was used.

7.8 Exercises

Exercise 10 Learning objectives: translate statistical notation into coded functions; learn about tools for checking validity of function arguments; practice writing functions that return multiple objects.


  1. Roughly speaking, the first option is slower because each time the vector is increased in size, R must resize the vector and re-allocate memory.