Chapter 7 Functions and Programming
We have been working with a wide variety of R functions, from simple functions such as mean()
and sd()
to more complex functions such as ggplot()
and apply()
. Gaining a better understanding of existing functions and the ability to write your own functions dramatically increases what we can do with R. Learning about R’s programming capabilities is an important step in gaining facility with functions.
7.1 R Functions
Data on the yield (pounds per acre) of two types of corn seeds (regular and kiln dried) were collected. Each of the 11 plots of land was split into two subplots, and one of the subplots was planted in regular corn while the other was planted in kiln dried corn. These data were analyzed in a famous paper authored by William Gosset. Here are the data.
> u.corn <- "https://www.finley-lab.com/files/data/corn.csv"
> corn <- read.csv(u.corn, header = TRUE)
> corn
regular kiln_dried
1 1903 2009
2 1935 1915
3 1910 2011
4 2496 2463
5 2108 2180
6 1961 1925
7 2060 2122
8 1444 1482
9 1612 1542
10 1316 1443
11 1511 1535
A paired t test, or a confidence interval for the mean difference, may be used to assess the difference in yield between the two varieties. Of course R has a function t.test
that performs a paired t test and computes a confidence interval, but we will perform the test without using that function. We will focus for now on testing the hypotheses \(H_0\colon \mu_d = 0\) versus \(H_a\colon \mu_d \neq 0\) and on a two-sided confidence interval for \(\mu_d\). Here \(\mu_d\) represents the population mean difference.
The paired \(t\) statistic is defined by \[\begin{equation} t = \frac{\overline d}{S_d/\sqrt{n}} \end{equation}\]
where \(\overline d\) is the mean of the differences, \(S_d\) is the standard deviation of the differences, and \(n\) is the sample size. The p-value is twice the area to the right of \(|t_{\text{obs}}|\), where \(t_{\text{obs}}\) is the observed \(t\) statistic, and a confidence interval is given by \[\begin{equation} \overline d \pm t^* (S_d/\sqrt{n}). \end{equation}\]
Here \(t^*\) is an appropriate quantile of a \(t\) distribution with \(n-1\) degrees of freedom.
> dbar <- mean(corn$kiln_dried - corn$regular)
> n <- length(corn$regular)
> S_d <- sd(corn$kiln_dried - corn$regular)
> t_obs <- dbar/(S_d/sqrt(n))
> t_obs
[1] 1.69
[1] 0.1218
[1] -10.73
[1] 78.18
With a few lines of R code we have calculated the t statistic, the p-value, and the confidence interval. Since paired t tests are pretty common, however, it would be helpful to automate this procedure. One obvious reason is to save time and effort, but another important reason is to avoid mistakes. It would be easy to make a mistake (e.g., using \(n\) instead of \(n-1\) as the degrees of freedom) when repeating the above computations.
Here is a first basic function which automates the computation.
> paired_t <- function(x1, x2) {
+ n <- length(x1)
+ dbar <- mean(x1 - x2)
+ s_d <- sd(x1 - x2)
+ tstat <- dbar/(s_d/sqrt(n))
+ pval <- 2 * (1 - pt(abs(tstat), n - 1))
+ margin <- qt(0.975, n - 1) * s_d/sqrt(n)
+ lcl <- dbar - margin
+ ucl <- dbar + margin
+ return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
+ }
And here is the function in action
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -10.73
$ucl
[1] 78.18
An explanation and comments on the function are in order.
paired_t <- function(x1, x2)
assigns a function of two variables,x1
andx2
, to an R object calledpaired_t
.- The compound expression, i.e., the code that makes up the body of the function, is enclosed in curly braces
{}
. return(list(tstat = tstat, pval = pval, lcl=lcl, ucl=ucl))
indicates the object(s) returned by the function. In this case the function returns a list with four components.- The body of the function basically mimics the computations required to compute the t statistic, the p-value, and the confidence interval.
- Several objects such as
n
anddbar
are created in the body of the function. These objects are NOT available outside the function. We will discuss this further when we cover environments and scope in R.
Our function has automated the basic calculations. But it is still somewhat limited in usefulness. For example, it only computes a 95% confidence interval, while a user may want a different confidence level. And the function only performs a two-sided test, while a user may want a one-sided procedure. We modify the function slightly to allow the user to specify the confidence level next.
> paired_t <- function(x1, x2, cl = 0.95) {
+ n <- length(x1)
+ dbar <- mean(x1 - x2)
+ s_d <- sd(x1 - x2)
+ tstat <- dbar/(s_d/sqrt(n))
+ pval <- 2 * (1 - pt(abs(tstat), n - 1))
+ pctile <- 1 - (1 - cl)/2
+ margin <- qt(pctile, n - 1) * s_d/sqrt(n)
+ lcl <- dbar - margin
+ ucl <- dbar + margin
+ return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
+ }
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -10.73
$ucl
[1] 78.18
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -29.5
$ucl
[1] 96.96
Two things to note. First, arguments do not have to be named. So
paired_t(corn$kiln_dried, corn$regular)
and
paired_t(x1 = corn$kiln_dried, x2 = corn$regular)
are equivalent. But we need to be careful if we do not name arguments because then we have to know the ordering of the arguments in the function declaration.
Second, in the declaration of the function, the third argument cl
was given a default value of 0.95
. If a user does not specify a value for cl
it will silently be set to 0.95
. But of course a user can override this, as we did in
paired_t(corn$kiln_dried, corn$regular, cl = 0.99)
7.1.1 Practice Problem
Like all things in R, getting the hang of writing functions just requires practice. Create a simple function called FtoK
that is given a temperature in Farenheit and converts it to Kelvin using the following formula
\[ K = (F - 32) * \frac{5}{9} + 273.15 \]
You should get the following if your function is correct
[1] 299.8
7.1.2 Creating Functions
Creating very short functions at the command prompt is a reasonable strategy. For longer functions, one option is to write the function in a script and then submit the whole function. Or a function can be written in any text editor, saved as a plain text file (possibly with a .r
extension), and then read into R using the source()
function.
7.2 Programming: Conditional Statements
The paired_t
function is somewhat useful, but could be improved in several ways. For example, consider the following:
Warning in x1 - x2: longer object length is not a
multiple of shorter object length
Warning in x1 - x2: longer object length is not a
multiple of shorter object length
$tstat
[1] 1
$pval
[1] 0.3739
$lcl
[1] -1.421
$ucl
[1] 3.021
The user specified data had different numbers of observations in x1
and x2
, which of course can’t be data tested by a paired t test. Rather than stopping and letting the user know that this is a problem, the function continued and produced meaningless output.
Also, the function as written only allows testing against a two-sided alternative hypothesis, and it would be good to allow one-sided alternatives.
First we will address some checks on arguments specified by the user. For this we will use an if()
function and a stop()
function.
> paired_t <- function(x1, x2, cl = 0.95) {
+ if (length(x1) != length(x2)) {
+ stop("The input vectors must have the same length")
+ }
+ n <- length(x1)
+ dbar <- mean(x1 - x2)
+ s_d <- sd(x1 - x2)
+ tstat <- dbar/(s_d/sqrt(n))
+ pval <- 2 * (1 - pt(abs(tstat), n - 1))
+ pctile <- 1 - (1 - cl)/2
+ margin <- qt(pctile, n - 1) * s_d/sqrt(n)
+ lcl <- dbar - margin
+ ucl <- dbar + margin
+ return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
+ }
> paired_t(1:5, 1:4)
Error in paired_t(1:5, 1:4): The input vectors must have the same length
The argument to the if()
function is evaluated. If the argument returns TRUE
the ensuing code is executed. Otherwise, the ensuing code is skipped and the rest of the function is evaluated. If a stop()
function is executed, the function is exited and the argument of stop()
is printed.
To better understand and use if()
statements, we need to understand comparison operators and logical operators.
7.2.1 Comparison and Logical Operators
We have made use of some of the comparison operators in R. These include
- Equal:
==
- Not equal:
!=
- Greater than:
>
- Less than:
<
- Greater than or equal to:
>=
- Less than or equal to:
<=
Special care needs to be taken with the ==
and !=
operators because of how numbers are represented on computers, see Section 7.3.
There are also three logical operators, with two variants of the “and” operator and the “or” operator.
- and: Either
&
or&&
- or: Either
|
or||
- not:
!
The “double” operators &&
and ||
just examine the first element of the two vectors, whereas the “single” operators &
and |
compare element by element.
[1] TRUE
[1] TRUE TRUE FALSE
[1] FALSE
[1] FALSE TRUE FALSE
We can use the logical operators to check whether a user-specified confidence level is between 0 and 1.
> paired_t <- function(x1, x2, cl = 0.95) {
+ if (length(x1) != length(x2)) {
+ stop("The input vectors must have the same length")
+ }
+ if (cl <= 0 || cl >= 1) {
+ stop("The confidence level must be between 0 and 1")
+ }
+ n <- length(x1)
+ dbar <- mean(x1 - x2)
+ s_d <- sd(x1 - x2)
+ tstat <- dbar/(s_d/sqrt(n))
+ pval <- 2 * (1 - pt(abs(tstat), n - 1))
+ pctile <- 1 - (1 - cl)/2
+ margin <- qt(pctile, n - 1) * s_d/sqrt(n)
+ lcl <- dbar - margin
+ ucl <- dbar + margin
+ return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
+ }
> paired_t(1:5, 2:6, cl = 15)
Error in paired_t(1:5, 2:6, cl = 15): The confidence level must be between 0 and 1
7.2.2 If else statements
The if()
statement we have used so far has the form
if (condition) {
expression
}
Often we want to evaluate one expression if the condition is true, and evaluate a different expression if the condition is false. That is accomplished by the if else
statement. Here we determine whether a number is positive, negative, or zero.
> Sign <- function(x) {
+ if (x < 0) {
+ print("the number is negative")
+ } else if (x > 0) {
+ print("the number is positive")
+ } else {
+ print("the number is zero")
+ }
+ }
> Sign(3)
[1] "the number is positive"
[1] "the number is negative"
[1] "the number is zero"
Notice the “different expression” for the first if
statement was itself an if
statement.
Next we modify the paired_t
function to allow two-sided and one-sided alternatives.
> paired_t <- function(x1, x2, cl = 0.95, alternative = "not.equal") {
+ if (length(x1) != length(x2)) {
+ stop("The input vectors must be of the same length")
+ }
+ if (cl <= 0 || cl >= 1) {
+ stop("The confidence level must be between 0 and 1")
+ }
+ n <- length(x1)
+ dbar <- mean(x1 - x2)
+ s_d <- sd(x1 - x2)
+ tstat <- dbar/(s_d/sqrt(n))
+ if (alternative == "not.equal") {
+ pval <- 2 * (1 - pt(abs(tstat), n - 1))
+ pctile <- 1 - (1 - cl)/2
+ margin <- qt(pctile, n - 1) * s_d/sqrt(n)
+ lcl <- dbar - margin
+ ucl <- dbar + margin
+ } else if (alternative == "greater") {
+ pval <- 1 - pt(tstat, n - 1)
+ margin <- qt(cl, n - 1) * s_d/sqrt(n)
+ lcl <- dbar - margin
+ ucl <- Inf
+ } else if (alternative == "less") {
+ pval <- pt(tstat, n - 1)
+ margin <- qt(cl, n - 1) * s_d/sqrt(n)
+ lcl <- -Inf
+ ucl <- dbar + margin
+ }
+
+ return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
+ }
> paired_t(corn$kiln_dried, corn$regular)
$tstat
[1] 1.69
$pval
[1] 0.1218
$lcl
[1] -10.73
$ucl
[1] 78.18
$tstat
[1] 1.69
$pval
[1] 0.9391
$lcl
[1] -Inf
$ucl
[1] 69.89
$tstat
[1] 1.69
$pval
[1] 0.06091
$lcl
[1] -2.434
$ucl
[1] Inf
7.3 Computer Arithmetic
Like most software, R does not perform exact arithmetic. Rather, R follows the IEEE 754 floating point standards. This can have profound effects on how computational algorithms are implemented, but is also important when considering things like comparisons.
Note first that computer arithmetic does not follow some of the rules of ordinary arithmetic. For example, it is not associative.
[1] 0.0000000009313
[1] 0.0000000009313
[1] 0
Computer arithmetic is not exact.
[1] 0.1
[1] FALSE
[1] 0.00000000000000008327
So for example an if
statement that uses an equality test may not give the expected answer. One way to avoid this problem is to test “near equality” using all.equal()
. The function takes as arguments two objects to be compared, and a tolerance. If the objects are within the tolerance of each other, the function returns TRUE
. The tolerance has a default value of about \(1.5\times 10^{-8}\), which works well in many cases.
[1] TRUE
7.4 Loops
Loops are an important component of any programming language, including R. Vectorized calculations and functions such as apply()
make loops a bit less central to R than to many other languages, but an understanding of the three looping structures in R is still quite important.
We will investigate loops in the context of computing what is sometimes called the “machine epsilon.” Because of the inexact representation of numbers in R (and other languages) sometimes R cannot distinguish between the numbers 1
and |1 + x|
for small values of x
. The smallest value of x
such that 1
and |1+x|
are not declared equal is the machine epsilon.
[1] FALSE
[1] TRUE
Clearly the machine epsilon is somewhere between \(10^{-4}\) and \(10^{-50}\). How can we find its value exactly? Since floating point numbers use a binary representation, we know that the machine epsilon will be equal to \(1/2^k\) for some value of \(k\). So to find the machine epsilon, we can keep testing whether \(1\) and \(1+1/2^k\) are equal, until we find a value \(k\) where the two are equal. Then the machine epsilon will be \(1/2^{k-1}\), since it is the smallest value for which the two are NOT equal.
[1] FALSE
[1] FALSE
[1] FALSE
Testing by hand like this gets tedious quickly. A loop can automate the process. We will do this with two R loop types, repeat
and while
.
7.4.1 A Repeat Loop
A repeat
loop just repeats a given expression over and over again until a break
statement is encountered.
[1] 53
[1] 0.000000000000000222
This code initializes k
at 1. The body of the loop initially checks whether \(1\) and \(1+1/2^k\) are equal. If they are equal, the break
statement is executed and control is transferred outside the loop. If they are not equal, k
is increased by 1, and we return to the beginning of the top of the body of the loop.
7.4.2 A While Loop
A while
loop has the form
while (condition) {
expression
}
As long as the condition
is TRUE
the expression
is evaluated. Once the condition
is FALSE
control is transferred outside the loop.
[1] 53
[1] 0.000000000000000222
7.4.3 A For Loop
A for
loop has the form
for (variable in vector) {
expression
}
The for
loop sets the variable
equal to each element of the vector
in succession, and evaluates the expression
each time. Here are two different ways to use a for
loop to calculate the sum of the elements in a vector.
[1] 55
[1] 55
In the first case we loop over the positions of the vector elements, while in the second case we loop over the elements themselves.
7.4.4 Practice Problem
Often when students initially learn about if()
statements and for()
loops, they use them more often than is necessary, leading to over complications and often slower, less sophisticated code. Many simple if()
statements can be accomplished using logical subsetting and vectorization. Using the ideas you used above, and what you learned in Chapters 4 and 6, we can use logical subsetting to replace the simplest of if()
statements. Rewrite the following if()
statement and for()
loop using logical subsetting.
7.5 Efficiency Considerations
In many contexts R and modern computers are fast enough that the user does not need to worry about writing efficient code. There are a few simple ways to write efficient code that are easy enough, and provide enough speed-up, that they are worth following as often as possible. The R function system.time()
reports how long a set of R code takes to execute, and we will use this function to compare different ways to accomplish objectives in R.
7.5.1 Growing Objects
Consider two ways to create a sequence of integers from 1 to n, implemented in functions f1
and f2
.
- Start with a zero-length vector, and let it grow:
- Start with a vector of length \(n\) and fill in the values:
Here are the two functions in action, with \(n = 100000\).
user system elapsed
11.393 0.036 11.432
user system elapsed
0.014 0.000 0.015
It is much more efficient to start with a full-length vector and then fill in values.37
Of course another way to create a vector of integers from 1 to n is to use 1:n
. Let’s see how fast this is.
user system elapsed
0 0 0
user system elapsed
0 0 0
For \(n=100000\) this is so fast the system time is very close to zero. Even when \(n\) is 1000000 the time is very small. So another important lesson is to use built-in R functions whenever possible, since they have had substantial development focused on efficiency, correctness, etc.
7.5.2 Vectorization
Next consider calculating the sum of the squared entries in each column of a matrix. For example with the matrix \(M\),
\[ M = \left(\begin{array}{ccc} 1 & 2 & 3 \\ 4 & 5 & 6 \end{array}\right), \]
the sums would be \(1^2 + 4^2 = 17\), \(2^2 + 5^2 = 29\), and \(3^2 + 6^2 = 45\). One possibility is to have an outer loop that traverses the columns and an inner loop that traverses the rows within a column, squaring the entries and adding them together.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> ss1 <- function(M) {
+ n <- dim(M)[1]
+ m <- dim(M)[2]
+ out <- rep(0, m)
+ for (j in 1:m) {
+ for (i in 1:n) {
+ out[j] <- out[j] + M[i, j]^2
+ }
+ }
+ return(out)
+ }
> ss1(test_matrix)
[1] 17 29 45
Another possibility eliminates the inner loop, using the sum()
function to compute the sum of the squared entries in the column directly.
> ss2 <- function(M) {
+ m <- dim(M)[2]
+ out <- numeric(m)
+ for (j in 1:m) {
+ out[j] <- sum(M[, j]^2)
+ }
+ return(out)
+ }
> ss2(test_matrix)
[1] 17 29 45
A third possibility uses the colSums()
function.
[1] 17 29 45
Here is a speed comparison, using a \(1000\times 10000\) matrix.
user system elapsed
2.454 0.000 2.454
user system elapsed
0.080 0.016 0.096
user system elapsed
0.128 0.024 0.152
7.6 More on Functions
Understanding functions deeply requires a careful study of R’s scoping rules, as well as a good understanding of environments in R. That’s beyond the scope of this book, but we will briefly discuss some issues that are most salient. For a more in-depth treatment, see “Advanced R” by Hadley Wickham, especially the chapters on functions and environments.
7.7 Calling Functions
When using a function, the functions arguments can be specified in three ways:
- By the full name of the argument.
- By the position of the argument.
- By a partial name of the argument.
> tmp_function <- function(first.arg, second.arg, third.arg,
+ fourth.arg) {
+ return(c(first.arg, second.arg, third.arg, fourth.arg))
+ }
> tmp_function(34, 15, third.arg = 11, fou = 99)
[1] 34 15 11 99
Positional matching of arguments is convenient, but should be used carefully, and probably limited to the first few, and most commonly used, arguments in a function. Partial does have pitfalls. A partially specified argument must unambiguously match exactly one argument—a requirement that’s not met below.
> tmp_function <- function(first.arg, fourth.arg) {
+ return(c(first.arg, fourth.arg))
+ }
> tmp_function(1, f = 2)
Error in tmp_function(1, f = 2): argument 2 matches multiple formal arguments
7.7.1 The ...
argument
In defining a function, a special argument denoted by ...
can be used. Sometimes this is called the “ellipsis” argument, sometimes the “three dot” argument, sometimes the “dot dot dot” argument, etc. The R language definition https://cran.r-project.org/doc/manuals/r-release/R-lang.html describes the argument in this way:
The special type of argument `…’ can contain any number of supplied arguments. It is used for a variety of purposes. It allows you to write a function that takes an arbitrary number of arguments. It can be used to absorb some arguments into an intermediate function which can then be extracted by functions called subsequently.
Consider for example the sum()
function.
[1] 15
[1] 112
[1] 118
Think about writing such a function. There is no way to predict in advance the number of arguments a user might specify. So the function is defined with ...
as the first argument:
function (..., na.rm = FALSE) .Primitive("sum")
This is true of many commonly-used functions in R such as c()
among others.
Next, consider a function that calls another function in its body. For example, suppose that a collaborator always supplies comma delimited files that have five lines of description, followed by a line containing variable names, followed by the data. You are tired of having to specify skip = 5
, header = TRUE
, and sep = ","
to read.table()
and want to create a function my.read()
which uses these as defaults.
> my.read <- function(file, header = TRUE, sep = ",", skip = 5,
+ ...) {
+ read.table(file = file, header = header, sep = sep,
+ skip = skip, ...)
+ }
The ...
in the definition of my.read()
allows the user to specify other arguments, for example, stringsAsFactors = FALSE
. These will be passed on to the read.table()
function. In fact, that is how read.csv()
is defined.
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x7fe63a1d5b60>
<environment: namespace:utils>
7.7.2 Lazy Evaluation
Arguments to R functions are not evaluated until they are needed, sometimes called lazy evaluation.
[1] 9
[1] 27
[1] 6
[1] 9
[1] 27
Error in print(a * b): argument "b" is missing, with no default
The first call specified both of the arguments a
and b
, and produced the expected output. In the second call the argument b
was not specified. Since it was not needed until the third print
statement, R happily executed the first two print
statements, and only reported an error in the third statement, when b
was needed to compute a*b
.
[1] 16
[1] 20
In the first call, since b
was not specified, it was computed as a^3
. In the second call, b
was specified, and the specified value was used.
7.8 Exercises
Exercise 10 Learning objectives: translate statistical notation into coded functions; learn about tools for checking validity of function arguments; practice writing functions that return multiple objects.
Roughly speaking, the first option is slower because each time the vector is increased in size, R must resize the vector and re-allocate memory.↩