Preface

This document is intended to be used by students in a graduate level introductory statistics sequence of courses. The primary intended audience has some basic familiarity with the R ecosystem and has used several data analytic functions. This audience also has had some exposure to SPSS methods for the analogous set of functions/procedures outlined here and the exposition makes reference, at times, to these SPSS approaches. Nonetheless, the document is written in such a way that a more general audience can find the document useful. It will be most helpful for researchers and students who are in early phases of a learning curve for the R language. It takes a functional approach rather than a programming approach and can be viewed as an applied document.

1 Introduction

Researchers who confront the prospect of learning to use R as a data analytic tool quickly realize the need to understand the three major domains of usage: 1. Data Import/Defnition and Modification (also called Data Wrangling in the Data Science world) 2. Data Visualization - both exploratory and presentation 3. Data Analysis - description, inference, and modeling

Of these, the first domain (especially data modification) may be the most uncomfortable, especially for those trained in commercial packages such as SPSS. This document focuses on some simple operations that are used with great frequency by virtually all researchers. The intent is NOT to give a comprehensive treatment of all Data Wrangling strategies, but rather a few basic ones that are likely to be confronted early in every research project’s data analysis.

Data frames in R are structured in an arrangement where rows are cases and columns are variables. Often, we need to sort the rows according to values of one or more variables. Users of popular commercial software find this operation simple (e.g., with mouse/menuing operations in SPSS or writing code/syntax in SPSS or SAS) to implement with the sorting cases function. Similarly, we often need to select a subset of cases (analogous to the select cases procedure in SPSS) or a subset of variables. In R, several approaches to these methods are possible. Two general strategies are outlined here. One is the use of base system functions and the other is the tidyverse approach.

In order to make this document a bit more contextually full, I also develop strategies for creation of new variables (both numeric and logicals).

1.2 The data set

Let’s use the mtcars data set available in R. Looking at the structure and the first few rows gives us a sense of the data set. Notice that all variables are numeric. The “am” variable is transmission type, automatic or manual, coded 1 and zero respectively. This should actually be seen as a factor, and later we will change it to a factor and apply value labels. Exact specification of what the variables represent can be found by doing a help on the data frame: ?mtcars.

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Notice that the table above appears to have the car model as a row name, rather than as a variable. And if we use the gt function to display the data frame (or part of it using the headTail function) as a nicer table, then we don’t see those row names. This may or may not be what we want. It is possible to include the row names (car makes) as a variable and we will see how to do that in a section below where we learn to create new variables.

In addition, the reader will notice that this document variously uses gt or kable to obtain nicer formatting of tables and data frame listing. The choice is usually based on whether I wished to show the row names (vehicle makes) as part of the table (with kable) or not (with gt)

mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.88 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.21 19.44 1 0 3 1
... ... ... ... ... ... ... ... ... ... ...
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

2 Data Frame Operations using base system funtions

All of the operations we consider in this document can be done with base system functions. In a later section we will also consider how to do the same things with tidyverse approaches

2.1 Selecting variables or cases - subsetting using the [ ] notation

Two primary approaches to subsetting are available with base system functions. A notation system using brackets, [ ] is capable of handling the rows or column selection often required when working with data frames. However, it is much more broadly capable of working with atomic vectors, matrices, lists and arrays. The second approach below, using the subset function, may be a simpler approach when working with data frames - as is the case with the most common analyses we encounter.

A brief introduction to the logical structure of this notation is helpful at this point. A pair of brackets (e.g., [ , ]) is most often associated with a data frame name. So, when we see something like mtcars[x,y], the x refers to something related to rows of the data frame and the y refers to something about columns. An series of illustrations can help. First, see that mtcars[2:4,] returns the 2nd through fourth rows of the data frame. The fact that nothing is specified after the comma implies that all columns will be returned.

##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Similarly, mtcars[,1:4] would return all rows but only the first through fourth columns. Since this would be N rows long, I suppressed the output.

These operations, using the bracketing notation are called subsetting. We can simultaneously subset the data frame for both rows and columns. Note that here, I selected non-adjacent rows and non-adjacent columns.

##                   mpg cyl  disp drat am
## Mazda RX4        21.0   6 160.0 3.90  1
## Hornet 4 Drive   21.4   6 258.0 3.08  0
## Toyota Corolla   33.9   4  71.1 4.22  1
## Toyota Corona    21.5   4 120.1 3.70  0
## Dodge Challenger 15.5   8 318.0 2.76  0
## AMC Javelin      15.2   8 304.0 3.15  0
## Camaro Z28       13.3   8 350.0 3.73  0

2.1.1 Select only a few of the Variables

In order to simplify our considerations, let’s subset this full data frame, keeping only a few of the variables so that our orderings are more visible. Note that “am” is a code for automatic vs manual transmission. Notice that standard R practice would be to create a new object for this subsetted data frame. The word subset is used in R in lieu of “select”, at least in base systems operations.

mpg cyl disp hp wt qsec am
21 6 160 110 2.62 16.46 1
21 6 160 110 2.88 17.02 1
22.8 4 108 93 2.32 18.61 1
21.4 6 258 110 3.21 19.44 0
... ... ... ... ... ... ...
15.8 8 351 264 3.17 14.5 1
19.7 6 145 175 2.77 15.5 1
15 8 301 335 3.57 14.6 1
21.4 4 121 109 2.78 18.6 1

2.1.2 Select a subset of cases

And to further simplify, for some initial illustrations, lets also subset only those cars that have 6 cylinder engines. This is effectively a “select cases” operation.

But first lets count the number of cases for each value of cylinder.

Var1 Freq
4 11
6 7
8 14

Now we know to expect seven cases from our row subsetting operation. The several sorting operations carried out below all begin with this data frame (data2). Here, we implement the selection by using the which function, nested inside the brackets. This approach provides an efficient way to conditionalize the output. The user needs to have some familiarity with the concept of using two equal signs here. The double equal sign == is a relational operator (used analogously to > or <, or <=, for example). The double equal sign is needed to differentiate this conditional/relational operation from the situation where a single equal sign (=) means assignment (equivalent to <-) and is also used to specify the value of an argument passed to a function (e.g., dbinom(x=4, size=25, prob=.5)). It helps to understand the double equal sign as meaning “exactly equal to.”

mpg cyl disp hp wt qsec am
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1
21.4 6 258.0 110 3.215 19.44 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
19.7 6 145.0 175 2.770 15.50 1

2.1.3 Select both subsets of cases and variables

We saw, above, how to select a subset of both cases and variables by using their numeric position. (code duplicated here without printing the result)

What if we want to conditionally subset the cases on a value of a variable (the 6 cylinder cars, for example) and also select a subset of variables? With the following code, an additional way of subsetting the variables is exemplified with a method that permits using the variable names rather than numeric positions.

mpg cyl disp hp wt qsec am
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1
21.4 6 258.0 110 3.215 19.44 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
19.7 6 145.0 175 2.770 15.50 1

This approach, using the ‘%in%’ operator effectively requests a subset of variables by choosing them from the list of “names” of variables associated with the data frame.

2.2 Selecting variables or cases using subset

Although the bracketing notation system, once learned, is rapid and effective, the subset function may be easier for the novice to master - in the limited situaion of working with data frames.

This section duplicates the selections outlined above where new data frames “data1”, and “data2” were produced. To parallel that notation, I will call them “data1b” and “data2b” here.

2.2.1 Select only a few of the Variables using subset

In order to simplify our considerations, let’s subset this full data frame, keeping only a few of the variables so that our orderings are more visible. The same set of variables are selected as above. Note the use of the colon to indicate a range of variables from their original order in the initial data frame.

mpg cyl disp hp wt qsec am
21 6 160 110 2.62 16.46 1
21 6 160 110 2.88 17.02 1
22.8 4 108 93 2.32 18.61 1
21.4 6 258 110 3.21 19.44 0
... ... ... ... ... ... ...
15.8 8 351 264 3.17 14.5 1
19.7 6 145 175 2.77 15.5 1
15 8 301 335 3.57 14.6 1
21.4 4 121 109 2.78 18.6 1

2.2.2 Select a subset of cases using subset

Once again, lets also subset only those cars that have 6 cylinder engines. This is effectively a “select cases” operation.

But first lets count the number of cases for each value of cylinder. We know to expect seven cases from this selection based on the parallel steps we took above.

mpg cyl disp hp wt qsec am
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1
21.4 6 258.0 110 3.215 19.44 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
19.7 6 145.0 175 2.770 15.50 1

It does seem to be a more straightforward approach using subset, but it can be even more efficient. We can combine the selection of cases and variables in one application of the subset function.

mpg cyl disp hp wt qsec am
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1
21.4 6 258.0 110 3.215 19.44 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
19.7 6 145.0 175 2.770 15.50 1

2.3 Random Samples of cases

At times, the need may arise for randomly sampling a subset of cases. The sample function provides this capability.

mpg cyl disp hp drat wt qsec vs am gear carb
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

It is possible to use the subset function to perform this case sampling action, but it is more complicated.

## [1] "Ferrari Dino"       "Fiat 128"           "Merc 450SL"        
## [4] "Hornet 4 Drive"     "AMC Javelin"        "Camaro Z28"        
## [7] "Mazda RX4"          "Cadillac Fleetwood"
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

Note that the dplyr approach outlined below is simpler than both of these methods illustrated here.

2.4 Sorting the data frame using base system methods

With this simplified data frame (data2) we will more readily be able to see the results of our sorting operations. For reference, here is that simplified data frame created above:

mpg cyl disp hp wt qsec am
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1
21.4 6 258.0 110 3.215 19.44 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
19.7 6 145.0 175 2.770 15.50 1

The most basic sorting operation is to sort a single numeric variable. Here, we sort the whole data frame as a function of car weight. Note that the default condition for the order function is to sort in ascending order. This generic usage of “bracketing” code is central for a variety of r operations using data frames.

mpg cyl disp hp wt qsec am
21.0 6 160.0 110 2.620 16.46 1
19.7 6 145.0 175 2.770 15.50 1
21.0 6 160.0 110 2.875 17.02 1
21.4 6 258.0 110 3.215 19.44 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
18.1 6 225.0 105 3.460 20.22 0

2.4.1 Sorting in descending order by adding an argument.

mpg cyl disp hp wt qsec am
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
17.8 6 167.6 123 3.440 18.90 0
21.4 6 258.0 110 3.215 19.44 0
21.0 6 160.0 110 2.875 17.02 1
19.7 6 145.0 175 2.770 15.50 1
21.0 6 160.0 110 2.620 16.46 1

2.4.2 Sorting with more than one variable

We can sort simultaneously on more than one variable. This might be easiest to see using the “am” variable as one of the sorted ones. In this illustration with “am” and “mpg”, it makes more sense to sort “am” first.

mpg cyl disp hp wt qsec am
17.8 6 167.6 123 3.440 18.90 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
21.4 6 258.0 110 3.215 19.44 0
19.7 6 145.0 175 2.770 15.50 1
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1

If we want to sort “am” as descending and mpg as ascending, we can still use just one implementation of the order function. However, to denote a descending/decreasing sort order, we put a minus sign in front of the variable name.

mpg cyl disp hp wt qsec am
19.7 6 145.0 175 2.770 15.50 1
21.0 6 160.0 110 2.620 16.46 1
21.0 6 160.0 110 2.875 17.02 1
17.8 6 167.6 123 3.440 18.90 0
18.1 6 225.0 105 3.460 20.22 0
19.2 6 167.6 123 3.440 18.30 0
21.4 6 258.0 110 3.215 19.44 0

2.4.3 Create New Variables. E.g., Transformations.

The need often arises for scale transformations. This class of operations is what some of you will be familiar with as the COMPUTE procedure in SPSS.

If we examined the distribution of the “horsepower” variable in the mtcars data frame (<hist(mtcars$hp)>) we would find it to be positively skewed. We might want to do a scale transform to normalize it and might find that a square root transformation works well. How can we create this square root transformation and add the variable to the data frame? Let’s do it with the minimized data frame we called “data2” above (the 6 cyl cars and only the subset of seven variables).

Even though the code for this operation is not challenging, readers may find that web searches for this method sometimes turn up a recommendation for a preceding step. The idea is to “initialize” the new variable by creating it with missing data for all cases. I find this unnecessaary, so I’ve commented it out as the first of the two lines of code that follow. The key idea here is that the arithmetic change to “hp” uses standard R code for the mathematical operator for square root. Essentially, we have “newvar” equals “old var” raised to the one-half power.

mpg cyl disp hp wt qsec am sqrthp
21.0 6 160.0 110 2.620 16.46 1 10.48809
21.0 6 160.0 110 2.875 17.02 1 10.48809
21.4 6 258.0 110 3.215 19.44 0 10.48809
18.1 6 225.0 105 3.460 20.22 0 10.24695
19.2 6 167.6 123 3.440 18.30 0 11.09054
17.8 6 167.6 123 3.440 18.90 0 11.09054
19.7 6 145.0 175 2.770 15.50 1 13.22876

2.4.4 Create New Variables. Conditionalizing new variable values (Recoding)

Often we wish to create new variables contingent on values of other variables. In the MTCARS data set, the “carb” variable is number of carburetors. It is coded as a numeric varialbe but could just as easily be a factor. Here, I create a new variable that is a factor with levels “high”, “medium”, and “small” to indicate groups of cars with carbureators in those subjectively created ranges. For this illustration the whole MTCARS data frame is used rather than the subsetted one from the illustrations above. First, a duplicate of the data frame is created just to separate this illustration from others in the document.

mpg cyl disp hp drat wt qsec vs am gear carb carbfactor
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 middle
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 middle
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 low
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 low
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 low
18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 low
At first glance, this looks ok, However, the new variable, “carbfactor” is actually only a character vector at this point.

## 'data.frame':    32 obs. of  12 variables:
##  $ mpg       : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl       : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp      : num  160 160 108 258 360 ...
##  $ hp        : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat      : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt        : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec      : num  16.5 17 18.6 19.4 17 ...
##  $ vs        : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am        : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear      : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb      : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ carbfactor: chr  "middle" "middle" "low" "low" ...

In order to convert it to a factor we can use the as.factor function.

This all works just fine, but is a bit laborious. An alternative that is more efficient is to use the recode function from the car package written by John Fox. It is used by a great many R analysts.

2.4.5 Using recode to recode/create new variables.

The example above with carburetors is one where we wanted to recode a numeric into a factor. This a can be accomplished with recode. Lets do the “carb” to “carbfactor” recode with the recode function in the car package.

Note that both dplyr and car have a recode function. Since both packages are loaded in this doc, I have explicitly called the car package recode here with the car:: syntax.

The use of car::recode is probably easier than the approach taken in the previous section, but the tidyverse approach (below) may be equally simple to learn.

mpg cyl disp hp drat wt qsec vs am gear carb carbfactor
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 middle
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 middle
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 low
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 low
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 low
18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 low

2.4.6 Recoding is a large topic

In addition, we need to broaden perspectives a bit. We have several potential related needs:

  • Recoding into the same variable
  • Recoding into a different variable
  • Recoding a numeric into a factor
  • Recoding a factor into a factor
  • Recoding a numeric into different numeric values or ranges.

There are many R functions in many packages that address this suite of needs. It is helpful to be aware of them, but this document has limited the illustrations to the ones above, and the dplyr approach below. The following functions are worthwhile exploring (in addition to the base system bracketing notation approach):

  • base
    • cut function
    • ifelse function
  • car
    • recode function
  • dplyr
    • recode function
    • if_else function
    • mutate function
  • sjmisc
    • rec function
    • rec_to function
    • rec_to_if)

The following URLs provides a starting point to describe the possibilities with some of these functions:

http://dwoll.de/rexrepos/posts/recode.html

http://www.cookbook-r.com/Manipulating_data/Recoding_data/

2.5 A post script on base system subsetting and data wrangling

It is important to realize that both the bracketing notation system and the subset function are designed for a far broader range of applications than just the data frame tasks we have outlined here (such as operations on vectors and matrices). Eventually, the user may find some pitfalls in suing the bracketing notation for recoding. For example, examine the results from this sequence of code and compare what the bracketing and subset approaches produce.

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0  NA  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## NA               NA  NA    NA  NA   NA    NA    NA NA NA   NA   NA
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

The product of the subset function (2nd illustration) would probably be what one expected and would have wanted. The product of the bracketd notation does something completely unexpected (the “NA” row). Rather than explain why (long story), it is useful to see this illustration as a result of the fact that the bracketing notation has nuanced meaning depending on its application. The issue with missing data is a reflection of that.

Perhaps the take-home message is that subset is less prone to problems in the types of applications that researchers most commonly encounter with data frames.

However, there is an easy way to fix the problem demonstrated above:

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

This issue is partly why I subset the data frame to six cylinder cars using the which function in conjunction with the bracketing notation when I first showed the select cases approach above (code repeated here).

3 Data Frame Operations with tidyverse tools

The base system methods above, using subsetting practices, the bracketing notation and the subset function may be non-intuitive to the R novice, but they work well once learned. Alternative approaches are available with tools from the dplyr package. These functions and notational structure may be simpler for the novice.

Here is a listing of the primary dplyr tools and their descriptions:

Functions Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns/variables
summarise() summarize values
group_by() allows for group operations in the split-apply-combine family
recode recode values of many kinds of variables and vectors
if_else use as a replacement for the base system ifelse function

3.1 Selecting a subset of variables using the select function from dplyr

We will begin with the same mtcars data frame and subset it so that we use the identical set of variables as above.

This is a bit simpler than the bracketing notation. Make sure to notice that the first argument passed to select is the name of the data frame to be subject to selection - then we just name the variables to be selected.

mpg cyl hp disp wt qsec am
21 6 110 160 2.62 16.46 1
21 6 110 160 2.88 17.02 1
22.8 4 93 108 2.32 18.61 1
21.4 6 110 258 3.21 19.44 0
... ... ... ... ... ... ...
15.8 8 264 351 3.17 14.5 1
19.7 6 175 145 2.77 15.5 1
15 8 335 301 3.57 14.6 1
21.4 4 109 121 2.78 18.6 1

Sometimes it is useful to select a subset of variables by indicating which variables are to be excluded, rather than included as we just did above. This can be done by placing a minus sign in front of the names of the variables to be excluded.

Same result, different way of getting there:

mpg cyl disp hp wt qsec am
21 6 160 110 2.62 16.46 1
21 6 160 110 2.88 17.02 1
22.8 4 108 93 2.32 18.61 1
21.4 6 258 110 3.21 19.44 0
... ... ... ... ... ... ...
15.8 8 351 264 3.17 14.5 1
19.7 6 145 175 2.77 15.5 1
15 8 301 335 3.57 14.6 1
21.4 4 121 109 2.78 18.6 1

3.2 Selecting Cases using the filter function from dplyr

Selecting cases (rows) is accomplished using the filter function from dplyr. It has the same syntax as select where the first argument to be passed is the name of the data frame, and then the specification of variable values on which to base the selection. This data8 data frame is identical to the data2 data frame that we used above in the base systems section We will do the same sorting on this dataframe that we did above (after detour in the section below this data8 creation)

mpg cyl hp disp wt qsec am
21.0 6 110 160.0 2.620 16.46 1
21.0 6 110 160.0 2.875 17.02 1
21.4 6 110 258.0 3.215 19.44 0
18.1 6 105 225.0 3.460 20.22 0
19.2 6 123 167.6 3.440 18.30 0
17.8 6 123 167.6 3.440 18.90 0
19.7 6 175 145.0 2.770 15.50 1

Also notice that we might select on the basis of more than one variable’s values. Here, as an illustration, we select cases that have manual transmissions, and have mpg less than 20.

mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

3.3 Using the arrange function for sorting data frames.

If we need to sort whole data frames according to one or more varialbes’ values, the arrange() function in dplyr is our tidyverse tool of choice. These illustrations mirror what we did to produce data3 through data6 above.

First, lets sort the data8 data frame on the “wt” variable.

mpg cyl hp disp wt qsec am
21.0 6 110 160.0 2.620 16.46 1
19.7 6 175 145.0 2.770 15.50 1
21.0 6 110 160.0 2.875 17.02 1
21.4 6 110 258.0 3.215 19.44 0
19.2 6 123 167.6 3.440 18.30 0
17.8 6 123 167.6 3.440 18.90 0
18.1 6 105 225.0 3.460 20.22 0

Next, sort on “wt” once again, but in descending order this time. Note the descending order specification with the minus sigh. We could also have specified desc(wt).

mpg cyl hp disp wt qsec am
18.1 6 105 225.0 3.460 20.22 0
19.2 6 123 167.6 3.440 18.30 0
17.8 6 123 167.6 3.440 18.90 0
21.4 6 110 258.0 3.215 19.44 0
21.0 6 110 160.0 2.875 17.02 1
19.7 6 175 145.0 2.770 15.50 1
21.0 6 110 160.0 2.620 16.46 1

Next, we repeat the data5 sort that sorted both on “am” and “mpg”, both ascending (ascending is the default).

mpg cyl hp disp wt qsec am
17.8 6 123 167.6 3.440 18.90 0
18.1 6 105 225.0 3.460 20.22 0
19.2 6 123 167.6 3.440 18.30 0
21.4 6 110 258.0 3.215 19.44 0
19.7 6 175 145.0 2.770 15.50 1
21.0 6 110 160.0 2.620 16.46 1
21.0 6 110 160.0 2.875 17.02 1

Finally, we will repeat the data6 sort by sorting on “am”, descending, and “mpg” ascending:

mpg cyl hp disp wt qsec am
19.7 6 175 145.0 2.770 15.50 1
21.0 6 110 160.0 2.620 16.46 1
21.0 6 110 160.0 2.875 17.02 1
17.8 6 123 167.6 3.440 18.90 0
18.1 6 105 225.0 3.460 20.22 0
19.2 6 123 167.6 3.440 18.30 0
21.4 6 110 258.0 3.215 19.44 0

3.4 Creating new variables with the mutate function from dplyr

New variables can be created and added to data frames simply with the mutate function. Two examples are provided here. First, lets do a numeric transformation of a variable. If we examined the distribution of the “horsepower” variable in the mtcars data frame (<hist(mtcars$hp)>) we would find it to be positively skewed. We might want to do a scale transform to normalize it and might find that a square root transformation works well. How can we create this square root transformation and add the variable to the data frame? Let’s do it with the minimized data frame we called “data8” above (the 6 cyl cars and only the subset of seven variables).

mpg cyl hp disp wt qsec am sqrt.hp
21.0 6 110 160.0 2.620 16.46 1 10.48809
21.0 6 110 160.0 2.875 17.02 1 10.48809
21.4 6 110 258.0 3.215 19.44 0 10.48809
18.1 6 105 225.0 3.460 20.22 0 10.24695
19.2 6 123 167.6 3.440 18.30 0 11.09054
17.8 6 123 167.6 3.440 18.90 0 11.09054
19.7 6 175 145.0 2.770 15.50 1 13.22876

A second example with mutate does something a little bit different. Please recall that the original mtcars data frame has rownames that are the make/model of the car. There may be instances where we would rather have those text descriptors as a variable (a string variable or factor). We can achieve that with mutate.

mpg cyl disp hp drat wt qsec vs am gear carb make
21 6 160 110 3.9 2.62 16.46 0 1 4 4 Mazda RX4
21 6 160 110 3.9 2.88 17.02 0 1 4 4 Mazda RX4 Wag
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 Datsun 710
21.4 6 258 110 3.08 3.21 19.44 1 0 3 1 Hornet 4 Drive
... ... ... ... ... ... ... ... ... ... ... NA
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 Ford Pantera L
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 Ferrari Dino
15 8 301 335 3.54 3.57 14.6 0 1 5 8 Maserati Bora
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 Volvo 142E

3.5 Recoding variables with the recode function in dplyr

This section duplicates the recode application that we did above with the car function also called recode. The dplyr recode function is used with similar logic. To be explicit, I accomplished recodes for individual values of carb by using separate arguments rather than ranges of values. Once again, an additional step is required to specify the recoded variable as a factor, and we can doublecheck with ‘str’.

## 'data.frame':    32 obs. of  12 variables:
##  $ mpg       : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl       : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp      : num  160 160 108 258 360 ...
##  $ hp        : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat      : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt        : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec      : num  16.5 17 18.6 19.4 17 ...
##  $ vs        : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am        : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear      : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb      : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ carbfactor: Factor w/ 3 levels "high","low","middle": 3 3 2 2 2 2 3 2 2 3 ...
mpg cyl disp hp drat wt qsec vs am gear carb carbfactor
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 middle
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 middle
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 low
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 low
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 low
18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 low

As was indicated in the earlier recode section in this document, there are many different types of applications of recoding that may be required. This section was meant only as a brief introduction to the dplyr::recode approach. The reader might consult the following URL for a broader set of illustrations using dplyr::recode:

https://dplyr.tidyverse.org/reference/recode.html

In addition to the dplyr::recode approach, the tidyverse provides several other possibilities for various recoding capabilities.

  • dplyr
    • recode function
    • if_else function
    • mutate function
  • forcats
    • fct_recode function plus many others for working with factors

See the following URL for information on the forcats package which is a tidyverse package that provides many helper functions for working with factors.

https://cran.r-project.org/web/packages/forcats/forcats.pdf

3.6 Pipes

In the world of tidyverse programming, there is a growing approach to writing code that is rather different than what you have seen from many of our examples over the course of the apsy510/511 courses. This involves the use of what are called pipes. To illustrate, lets assume we wanted to accomplish five things with our original “mtcars” data frame: 1. select only the 6 cylinder models 2. include only the “mpg”, “wt”, “hp”, and “am” variables 3. Calculate the square root of “hp” 4. Add value labels for the “am” variable (transmission type) 5. Summarize the “mpg” and “hp” variables as a function of the “am” variable (transmission type)

In base system code, we would create several intermediate data frames to acheive the final one on which to use tapply to obtain the group means.

##        0        1 
## 12.47573 10.81007

This approach (above) is comfortable, in some ways. It permits creation of the intermediate objects that can then be used for other purposes. But it is a bit inefficient and in some ways is the reverse of the way of thinking about extracting the information in a more natural sequence. The use of “pipes” is a way to be efficient and more direct in the process. We recognize that the goal was to obtain two group means on a created variable in a subset of the original data set. The “pipes” operator is %>%. One can read the following code in a manner that says, “with the mtcars3 data frame, first use ‘filter’ to choose six cylinder cars, then use ‘mutate’, then use ‘select’, then use ‘group_by’, and then summarize”. This singular flow accomplishes five steps in one process.

## # A tibble: 2 x 2
##   am        mean_sqrt.hp
##   <fct>            <dbl>
## 1 automatic         10.7
## 2 manual            11.4