Chapter 9 Classification

Classification problems are quite common. For example a spam filter is asked to classify incoming messages into spam or non-spam, based on factors such as the sender’s address, the subject of the message, the contents of the message, and so on. As another example, a doctor diagnosing a patient into one of four possible diagnoses based on symptoms, blood tests, and medical history is another form of classification. Or a bank may want to determine (prior to giving a loan) whether an applicant for a loan will default on the loan, again based on several variables such as income, financial history, etc.

Classification methods are both extremely useful and an active area of research in statistics. In this chapter we will learn about two common, and somewhat different, classification methods, logistic regression and \(k\) nearest neighbors. A very good introduction to classification and many other “Statistical Learning” methods is James et al. (2014). The abbreviated treatment in this chapter draws from James et al. (2014).

9.1 Logistic Regression

Logistic regression is widely used to relate a categorical response variable to one or more (continuous or categorical) predictors. Initially we will consider the simplest case where the response \(Y\) has only two possible values (we’ll assume the values are \(0\) and \(1\)) and where there is only one continuous predictor \(X\). We would like to predict the value of \(Y\) based on the value of the predictor \(X\). Let \(p(X) = P(Y = 1 | X)\).40. We will model \(p(X)\) with the logistic function, which takes values between \(0\) and \(1\), which is of course appropriate for modeling a probability:

\[ p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}. \]

A graph of this function when \(\beta_0=0\) and \(\beta_1 = 1\) shows the characteristic shape.

To make this more concrete, we will consider a data set on a population of women who were over 21 years of age, lived near Phoenix, Arizona, and were of Pima Indian heritage. The data included diabetes status (yes or no) and seven possible predictor variables such as age, number of pregnancies, body mass index, etc. The original data are from the National Institute of Diabetes and Digestive and Kidney Diseases. These data were cleaned and are available in the MASS package in R. In the package the data frame Pima.tr contains 200 randomly selected cases from the full data set, which we will use to find a classifier. The data frame Pima.te contains the remaining 332 cases, which we will use to test the classifier on new data.41

    npreg     glu      bp    skin     bmi     ped
1  0.4248 -1.1990 -0.2840 -0.1036 -0.3442 -0.3150
2  1.0189  2.2430 -0.1098  0.3228 -1.1761 -0.9692
3  0.4248 -1.4832  0.9356  1.0052  0.5693 -0.9920
4 -1.0605  1.2957  0.4129  1.1757  2.5431 -0.6567
5 -1.0605 -0.5359 -0.9809 -0.3595 -0.9641 -1.0669
6  0.4248 -0.8517  0.4129 -0.1889  0.5367 -0.2694
      age type diabetes
1 -0.7389   No        0
2  2.0856  Yes        1
3  0.2633   No        0
4 -0.5567   No        0
5 -0.8300   No        0
6  1.8122  Yes        1

It will be more convenient to code the presence or absence of diabetes by 1 and 0, so we first create another column in the data frame with this coding:

    npreg     glu      bp    skin     bmi     ped
1  0.4248 -1.1990 -0.2840 -0.1036 -0.3442 -0.3150
2  1.0189  2.2430 -0.1098  0.3228 -1.1761 -0.9692
3  0.4248 -1.4832  0.9356  1.0052  0.5693 -0.9920
4 -1.0605  1.2957  0.4129  1.1757  2.5431 -0.6567
5 -1.0605 -0.5359 -0.9809 -0.3595 -0.9641 -1.0669
6  0.4248 -0.8517  0.4129 -0.1889  0.5367 -0.2694
      age type diabetes
1 -0.7389   No        0
2  2.0856  Yes        1
3  0.2633   No        0
4 -0.5567   No        0
5 -0.8300   No        0
6  1.8122  Yes        1

We will begin with glu as a predictor, and of course type as the response. So we want to find the values \(\beta_0\) and \(\beta_1\) which provide the best fit for the model

\[ P(\text{yes} | \text{glu}) = \frac{e^{\beta_0 + \beta_1 (\text{glu})}}{1 + e^{\beta_0 + \beta_1 (\text{glu})}} \]

Usually maximum likelihood methods are used to fit the model.

In R we will use the function glm to fit logistic regression models. The glm function fits a wide variety of models. To specify the logistic regression model we specify family = binomial as an argument to glm. We also must specify the predictor and response variables via the model formula, which in our case will be diabetes ~ glu to indicate that diabetes (i.e., type recoded) is the response and glu is the predictor; also we specify the data frame, which in our case is Pima.tr.


Call:  glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)

Coefficients:
(Intercept)          glu  
      -0.82         1.20  

Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
Null Deviance:      256 
Residual Deviance: 207  AIC: 211

Call:
glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.971  -0.779  -0.529   0.849   2.263  

Coefficients:
            Estimate Std. Error z value     Pr(>|z|)
(Intercept)   -0.820      0.175   -4.67 0.0000029798
glu            1.197      0.199    6.02 0.0000000018
               
(Intercept) ***
glu         ***
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 256.41  on 199  degrees of freedom
Residual deviance: 207.37  on 198  degrees of freedom
AIC: 211.4

Number of Fisher Scoring iterations: 4
(Intercept) 
    -0.8196 
  glu 
1.197 

The coefficients \(\beta_0\) and \(\beta_1\) are approximately -0.82 and 1.197, respectively. So for example we can estimate the probability that a woman in this population whose glucose level is 150 would be diabetic as

[1] 0.5488

We can plot the fitted probabilities along with the data “by hand”.

The ggplot2 package also provides a way to do this more directly, using stat_smooth.

From these graphics we can see that although glucose level and diabetes are related, there are many women with high glucose levels who are not diabetic, and many with low glucose levels who are diabetic, so likely adding other predictors to the model will help.

Next let’s see how the model does in predicting diabetes status in the data we did not use for fitting the model. We will predict diabetes for anyone whose glucose level leads to a model-based probability greater than \(1/2\). First we use the predict function to compute the probabilities, and then use these to make predictions.

    npreg     glu       bp     skin      bmi     ped
1  0.7659  0.9423  0.02706  0.59882  0.04946  0.2714
2 -0.7568 -1.1232 -0.44171 -0.01669 -0.91169 -0.4883
3 -0.7568 -0.9921 -0.44171 -0.63219 -0.70573 -0.9948
4 -0.1477 -1.3527 -1.69178  0.29107 -0.30754 -0.7718
5 -0.4522  2.5488 -0.12920  1.62467 -0.37619 -1.0196
6  0.4614  1.5324  0.02706 -1.04253 -1.02154  0.1613
       age type
1  1.75661  Yes
2 -0.02973   No
3 -0.96992   No
4 -0.49983  Yes
5  2.03867  Yes
6  1.85063  Yes
      1       2       3       4       5       6 
0.57636 0.10308 0.11851 0.08031 0.90292 0.73380 

The predict function (with type = "response" specified) provides \(p(x) = P(Y = 1 | X = x)\) for all the \(x\) values in a data frame. In this case we specified the data frame Pima.te since we want to know how the model does in predicting diabetes in a new population, i.e., in a population that wasn’t used to “train” the model.

[1] "Yes" "No"  "No"  "No"  "Yes" "Yes"
                  
diabetes.predict.1  No Yes
               No  200  53
               Yes  23  56
[1] 0.7711

The table (sometimes called a confusion matrix) has the predictions of the model in the rows, so for example we see that the model predicts that \(206 + 58 = 264\) of the women will not be diabetic, and that \(17+51 = 68\) of the women will be diabetic. More interesting of course are the cells themselves. For example, of the \(206 + 17 = 223\) women who are not diabetic in Pima.te, the model correctly classifies 206, and misclassifies 17. A classifier that predicted perfectly for the test data would have zeros off the diagonal.

9.1.1 Adding Predictors

If we have \(p\) predictors \(X = (X_1, \dots, X_p)\), the logistic model becomes

\[ p(X) = p(X) = \frac{e^{\beta_0 + \beta_1 X + \dots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X + \dots + \beta_p X_p}}. \]

Although there is a lot more notation to keep track of, the basic ideas are the same as they were for the one predictor model. We will next see how adding bmi, the body mass index, affects predictions of diabetes.


Call:  glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)

Coefficients:
(Intercept)          glu          bmi  
     -0.880        1.131        0.552  

Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
Null Deviance:      256 
Residual Deviance: 198  AIC: 204

Call:
glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.058  -0.757  -0.431   0.801   2.249  

Coefficients:
            Estimate Std. Error z value    Pr(>|z|)
(Intercept)   -0.880      0.184   -4.78 0.000001784
glu            1.131      0.200    5.66 0.000000015
bmi            0.552      0.192    2.88       0.004
               
(Intercept) ***
glu         ***
bmi         ** 
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 256.41  on 199  degrees of freedom
Residual deviance: 198.47  on 197  degrees of freedom
AIC: 204.5

Number of Fisher Scoring iterations: 4

Now we look at predictions from this model.

      1       2       3       4       5       6 
0.55307 0.06578 0.08383 0.07046 0.85756 0.57189 
[1] "Yes" "No"  "No"  "No"  "Yes" "Yes"
                  
diabetes.predict.2  No Yes
               No  203  50
               Yes  20  59
[1] 0.7892

Adding bmi as a predictor did not improve the predictions by very much!

Let \(x_1\) and \(x_2\) represent glucose and bmi levels. We classify a subject as “diabetic” if the fitted \(p(X)\) is greater than \(0.5\), i.e., if the fitted probability of diabetes is greater than the fitted probability of not diabetes. The boundary for our decision is where these two fitted probabilities are equal, i.e., where

\[ \frac{P(Y = 1 | (x_1, x_2))}{P(Y = 0 | (x_1, x_2))} = \frac{P(Y = 1 | (x_1, x_2))}{1 - P(Y = 1 | (x_1, x_2))} = 1. \]

Writing these out in terms of the logistic regression model, taking logarithms, and performing some algebra leads to the following (linear!) decision boundary:

\[ x_2 = -\frac{\beta_0}{\beta_2} - \frac{\beta_1}{\beta_2} x_1. \]

The diabetes training data along with the decision boundary are plotted below.

9.1.2 More than Two Classes

Logistic regression methods also are applicable to classification contexts where there are more than two classes. Consider for example Fisher’s iris data.

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4
  Species
1  setosa
2  setosa
3  setosa
4  setosa
5  setosa
6  setosa
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here the potential predictors are sepal width, sepal length, petal width, and petal length. The goal is to find a classifier that will yield the correct species. From the scatter plots it should be pretty clear that a model with petal length and petal width as predictors would classify the data well. Although in a sense this is too easy, these data do a good job of illustrating logistic regression with more than two classes.

Before doing that we randomly choose 75 of the 150 rows of the data frame to be the training sample, with the other 75 being the test sample.

There are several packages which implement logistic regression for data with more than two classes. We will use the VGAM package. The function vglm within VGAM implements logistic regression (and much more).


Call:
vglm(formula = Species ~ Petal.Width + Petal.Length, family = multinomial, 
    data = iris.train)

Pearson residuals:
                           Min           1Q
log(mu[,1]/mu[,3]) -0.00000181 -0.000000246
log(mu[,2]/mu[,3]) -1.26528143 -0.000122333
                         Median          3Q       Max
log(mu[,1]/mu[,3]) 0.0000000436 0.000000063 0.0000164
log(mu[,2]/mu[,3]) 0.0000006634 0.002992883 2.4706594

Coefficients: 
               Estimate Std. Error z value Pr(>|z|)
(Intercept):1    118.95    9260.55      NA       NA
(Intercept):2     63.51      34.00      NA       NA
Petal.Width:1    -27.34   14731.81       0        1
Petal.Width:2    -14.55       7.47      NA       NA
Petal.Length:1   -26.61    6453.20       0        1
Petal.Length:2    -8.24       5.59      NA       NA

Names of linear predictors: log(mu[,1]/mu[,3]), 
log(mu[,2]/mu[,3])

Residual deviance: 8.682 on 144 degrees of freedom

Log-likelihood: -4.341 on 144 degrees of freedom

Number of Fisher scoring iterations: 21 

Warning: Hauck-Donner effect detected in the following estimate(s):
'(Intercept):1', '(Intercept):2', 'Petal.Width:2', 'Petal.Length:2'


Reference group is level  3  of the response

Notice that the family is specified as multinomial rather than binomial since we have more than two classes. When run with these data, the vglm function returns several (about 20) warnings. These occur mainly because the classes are so easily separated, and are suppressed above.

Next we compute the probabilities for the test data.

   setosa           versicolor
4       1 0.000000000010035925
5       1 0.000000000001598575
9       1 0.000000000001598575
11      1 0.000000000010035925
12      1 0.000000000063005974
14      1 0.000000000000001799
                                     virginica
4  0.00000000000000000000000000000001129030771
5  0.00000000000000000000000000000000078877050
9  0.00000000000000000000000000000000078877050
11 0.00000000000000000000000000000001129030771
12 0.00000000000000000000000000000016160727116
14 0.00000000000000000000000000000000000001748

At least for the first six cases, one probability is close to one and the other two are close to zero, reflecting the fact that this is an easy classification problem. Next we extract the actual predictions. For these, we want to choose the highest probability in each row. The which.max function makes this easy. Before applying this to the fitted probabilities, we illustrate its use. Take notice that which.max() returns the position of the highest probability, and not the actualy highest probability itself.

[1] 5
[1] 2
 4  5  9 11 12 14 
 1  1  1  1  1  1 
       4        5        9       11       12       14 
"setosa" "setosa" "setosa" "setosa" "setosa" "setosa" 

Next we create the confusion matrix.

                 
class.predictions setosa versicolor virginica
       setosa         26          0         0
       versicolor      0         19         1
       virginica       0          2        27

9.2 Nearest Neighbor Methods

Nearest neighbor methods provide a rather different way to construct classifiers, and have strengths (minimal assumptions, flexible decision boundaries) and weaknesses (computational burden, lack of interpretability) compared to logistic regression models.

In principle the idea is simple. Recall that the training set will have both \(x\) and \(y\) values known, while the test set will have only \(x\) values known. Begin by choosing a positive integer \(k\) which will specify the number of neighbors to use in classification. To classify a point \(x\) in the training set, find the \(k\) closest \(x\) values in the training set, and choose the class which has the highest representation among the \(k\) points. The algorithm is called kNN for “k Nearest Neighbors”.

For example, suppose that \(k=10\) and the 10 nearest neighbors to a training \(x\) have classes \(1, 1, 3, 2, 3, 3, 3, 2, 3, 2\). Since there are five 3s, three 2s, and two 1s, the training point is assigned to class 3. Suppose that for another \(x\) the 10 nearest neighbors have classes \(1,1,1,2,3,1,3,3,3,2\). In this case there are four 1s and four 3s, so there is a tie for the lead. The nearest neighbor algorithm then will choose between 1 and 3 at random.

Although in principle kNN is simple, some issues arise. First, how should \(k\) be chosen? There is not an easy answer, but it can help to think of the extreme values for \(k\).

The largest possible \(k\) is the number of observations in the training set. For example suppose that the training set has \(10\) observations, with classes \(1, 1, 1, 2, 2, 2, 3, 3, 3, 3\). Then for any point in the test set, the \(k=10\) nearest neighbors will include ALL of the points in the training set, and hence every point in the test set will be classified in class 3. This classifier has low (zero) variance, but probably has high bias.

The smallest possible \(k\) is 1. In this case, each point in the test set is put in the same class as its nearest neighbor in the training set. This may lead to a very non-smooth and high variance classifier, but the bias will tend to be small.

A second issue that is relatively easy to deal with concerns the scales on which the \(x\) values are measured. If for example one \(x\) variable has a range from 2 to 4, while another has a range from 2000 to 4000, the distance between the test and training data will be dominated by the second variable. The solution that is typically used is to standardize all the variables (rescale them so that their mean is 0 and their standard deviation is 1).

These and other issues are discussed in the literature on kNN, but we won’t pursue them further.

There are at least three R packages which implement kNN, including class, kknn, and RWeka. We will use class below.

An example from Hastie, Tibshirani, and Friedman (2009) will be used to give a better sense of the role of \(k\) in the kNN algorithm. The example uses simulated data and shows the decision boundaries for kNN with \(k=15\) and \(k=1\).42. Although the R code used to draw the displays is given below, focus mainly on the graphics produced, and what they tell us about kNN.

First the data are read into R and graphed. The predictors x1 and x2, while not standardized, have very similar standard deviations, so we will not standardize these data before applying kNN.

'data.frame':   200 obs. of  3 variables:
 $ x1   : num  2.5261 0.367 0.7682 0.6934 -0.0198 ...
 $ x2   : num  0.3211 0.0315 0.7175 0.7772 0.8673 ...
 $ class: int  0 0 0 0 0 0 0 0 0 0 ...

Next a large set of test data is created using the expand.grid function, which creates a data frame with all possible combinations of the arguments. First a simple example to illustrate the function, then the actual creation of the training set. The test set covers the range of the x1 and x2 values in the training set.

  x   y
1 1 5.0
2 2 5.0
3 1 3.4
4 2 3.4
5 1 2.0
6 2 2.0
[1] -2.521
[1] 4.171
[1] -2
[1] 2.856

Next the kNN with \(k=15\) is fit. Notice that the first argument gives the \(x\) values in the training set, the next argument gives the \(x\) values in the test set, the third argument gives the \(y\) values (labels) from the training set. The fourth argument gives \(k\), and the fifth argument asks for the function to return the probabilities of membership (that is, the proportion of the nearest neighbors which were in the majority class) as well as the class assignments.

[1] 0.6667 0.6667 0.6667 0.7333 0.7333 0.7333

Next the graphs are created. This is somewhat complex, since we want to plot the test data colored by the class they were assigned to by the kNN classifier as background, the training data (using a different symbol), and the decision boundary.

'data.frame':   3450 obs. of  5 variables:
 $ x1      : num  -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
 $ x2      : num  -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
 $ prob    : num  0.667 0.667 0.667 0.733 0.733 ...
 $ class   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ prob_cls: num  1 1 1 1 1 1 1 1 1 1 ...
[1] "x1"    "x2"    "class"

Next we graph the decision boundary of kNN with \(k=1\).

[1] 1 1 1 1 1 1
'data.frame':   3450 obs. of  5 variables:
 $ x1      : num  -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
 $ x2      : num  -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
 $ prob    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ class   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ prob_cls: num  1 1 1 1 1 1 1 1 1 1 ...

9.2.1 kNN and the Diabetes Data

Next kNN is applied to the diabetes data. We will use the same predictors, glu and bmi that were used in the logistic regression example. Since the scales of the predictor variables are substantially different, they are standardized first. The value \(k=15\) is chosen for kNN.

        
knn_Pima  No Yes
     No  206  55
     Yes  17  54

At least in terms of the confusion matrix, kNN with \(k=15\) performed about as well as logistic regression for these data.

9.2.2 Practice Problem

Produce a figure that displays how the number of false positives produced from the kNN classifier for the diabetes data set changes for all integer values of \(k\) from 1 to 40. Use this graph to justify whether or not \(k = 15\) was a valid choice for the number of neighbors.

9.2.3 kNN and the iris Data

Now kNN is used to classify the iris data. As before we use petal length and width as predictors. The scales of the two predictors are not particularly different, so we won’t standardize the predictors. Unsurprisingly kNN does well in classifying the test set for a wide variety of \(k\) values.

[1] 0.7286
[1] 1.672
    Sepal.Length Sepal.Width Petal.Length Petal.Width
144          6.8         3.2          5.9         2.3
140          6.9         3.1          5.4         2.1
36           5.0         3.2          1.2         0.2
38           4.9         3.6          1.4         0.1
58           4.9         2.4          3.3         1.0
50           5.0         3.3          1.4         0.2
       Species
144  virginica
140  virginica
36      setosa
38      setosa
58  versicolor
50      setosa
            
knn_iris     setosa versicolor virginica
  setosa         26          0         0
  versicolor      0         20         1
  virginica       0          1        27
            
knn_iris     setosa versicolor virginica
  setosa         26          0         0
  versicolor      0         19         1
  virginica       0          2        27
            
knn_iris     setosa versicolor virginica
  setosa         26          0         0
  versicolor      0         19         1
  virginica       0          2        27

9.3 Exercises

Exercise Classification Learning objectives: explore the logistic regression classification method; apply the kNN classification method; create confusion matrices to compare classification methods; plot classified data.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer New York. https://books.google.com/books?id=at1bmAEACAAJ.


  1. For those not familiar with probability notation, the right side of this equation reads as “The probability that Y = 1 given we know the value of X”)

  2. This practice of randomly separating data into training data and testing data is a common practice for many classification methods.

  3. The graphs below use code adapted from http://stackoverflow.com/questions/31234621/variation-on-how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-f