Workshop 2 Packages & Data Manipulation in R
2.1 Before we get started
2.1.1 Recap: What we learned in the previous tutorial
In the last tutorial, we learned a few things to get you started in using R:
- How to download R and R studio
- How to begin using an R Markdown
- Basic functions in R
- How to import data into R and view that data
- How to look at summary statistics for individual variables
Having trouble remembering what exactly an R Markdown is? Want some more resources for learning R?
2.2 Part I: The Universe of Packages
A package is a collection of related functions developed by people in the R community. Sometimes, a “package” can be a collection of other packages that provide useful functions. These are the workhorses of R, and probably every time you use R you will load in packages in order to help you do what you are trying to do with your data.
Herein lies one of the greatest advantages of using R: the fact that R allows really smart people in the R community to make packages with useful functions (a feature we would call being open source) means that practically anyone can create content for R, and as they do, new functionalities and additions become available through R, often faster than other statistical programs, such as SPSS or SAS.
If you are interested in learning more about packages and are looking for recommendations of a few useful ones, check out this website. It covers what we have talked about here and more!
2.2.1 How do I download and use packages?
2.2.1.1 Loading packages
Because there are many many packages made for R, only a few core packages (made by the original R developers) are included with your basic R program. For all other packages, they need to be downloaded. You only need to install packages once, but you’ll need to load them in every time you start R using the library()
function. Why have this two-step process to use packages? install.packages()
downloads the files from the big online R repository, called CRAN, and places them in a location on your computer where R can find them, while library()
makes it so that, in your current R session, you can use all of the functions that are defined in these packages. You can think of it like install.packages()
is like buying the toolkit from the store for the tools you need for your project, and library()
is like taking the tools out of the toolbox that you need at a given time.
To use install.packages
, you must use quotation marks to quote the package you want to install (e.g. install.packages("ggplot2")
). library(package_name)
does not need quotation marks and will call the package so that you can use the functions within the package. So, to be clear: we only need to install packages once, but each time we start a new R session, we need to reload each package we will be using. There are also some useful functions for checking out the packages you have in your library. Below are detailed some useful functions for checking out and learning more about your packages.
# install.packages("tidyverse")
# install.packages("ggplot2")
# install.packages("Hmisc")
# install.packages("psych")
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## starting httpd help server ...
## done
packageDescription("psych") # Shows a description with the information of who created the package, where it is stored in your computer, etc.
## Package: psych
## Version: 2.3.3
## Date: 2023-03-14
## Title: Procedures for Psychological, Psychometric, and Personality
## Research
## Authors@R: person("William", "Revelle", role =c("aut","cre"),
## email="revelle@northwestern.edu", comment=c(ORCID =
## "0000-0003-4880-9610") )
## Description: A general purpose toolbox developed orginally for
## personality, psychometric theory and experimental psychology.
## Functions are primarily for multivariate analysis and scale
## construction using factor analysis, principal component
## analysis, cluster analysis and reliability analysis, although
## others provide basic descriptive statistics. Item Response
## Theory is done using factor analysis of tetrachoric and
## polychoric correlations. Functions for analyzing data at
## multiple levels include within and between group statistics,
## including correlations and factor analysis. Validation and
## cross validation of scales developed using basic machine
## learning algorithms are provided, as are functions for
## simulating and testing particular item and test structures.
## Several functions serve as a useful front end for structural
## equation modeling. Graphical displays of path diagrams,
## including mediation models, factor analysis and structural
## equation models are created using basic graphics. Some of the
## functions are written to support a book on psychometric theory
## as well as publications in personality research. For more
## information, see the <https://personality-project.org/r/> web
## page.
## License: GPL (>= 2)
## Imports: mnormt,parallel,stats,graphics,grDevices,methods,lattice,nlme
## Suggests: psychTools, GPArotation, lavaan, lme4, Rcsdp, graph, knitr,
## Rgraphviz
## LazyData: yes
## ByteCompile: true
## VignetteBuilder: knitr
## URL: https://personality-project.org/r/psych/
## https://personality-project.org/r/psych-manual.pdf
## NeedsCompilation: no
## Packaged: 2023-03-17 21:40:20 UTC; WR
## Author: William Revelle [aut, cre]
## (<https://orcid.org/0000-0003-4880-9610>)
## Maintainer: William Revelle <revelle@northwestern.edu>
## Repository: CRAN
## Date/Publication: 2023-03-18 00:50:02 UTC
## Built: R 4.3.0; ; 2023-05-30 01:13:56 UTC; windows
##
## -- File: C:/Program Files/R/R-4.3.0/library/psych/Meta/package.rds
help(package = "psych") # among other things, lists each of the functions in the package.
library() # Shows a list of all packages you have in libraries with short description
browseVignettes(package = "psych") # shows you how to use some of the common functions within the package
vignette("dplyr") # calls helpful vignettes for the dplyr package
2.2.1.2 Be careful: package order matters
A word of caution: Sometimes, the order in which you load packages into R matters. Consider the following chunks of code:
## Warning: package 'Hmisc' was built under R version 4.3.1
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## d$mpg
## n missing distinct Info Mean Gmd .05 .10
## 32 0 25 0.999 20.09 6.796 12.00 14.34
## .25 .50 .75 .90 .95
## 15.43 19.20 22.80 30.09 31.30
##
## lowest : 10.4 13.3 14.3 14.7 15 , highest: 26 27.3 30.4 32.4 33.9
## Warning: package 'Hmisc' was built under R version 4.3.1
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 32 20.09 6.03 19.2 19.7 5.41 10.4 33.9 23.5 0.61 -0.37 1.07
What did you notice? Even though we used the exact same command to describe the variable mpg
, we get different output! Herein lies one inherent problem with packages: different packages sometimes use the same name for their functions, especially when its something generic, like describe()
. So, when we try to load packages into R with functions that have the same name as each other, its impossible for R to know which one we want to use. Consequently, R will always use the function from the most recently called package: in the first chunk R used the function from the Hmisc package, and in the second chunk it used the function from the psych package.
If we want to use both functions, we can force which package R pulls the function from using two colons, like this: package::function
:
## d$mpg
## n missing distinct Info Mean Gmd .05 .10
## 32 0 25 0.999 20.09 6.796 12.00 14.34
## .25 .50 .75 .90 .95
## 15.43 19.20 22.80 30.09 31.30
##
## lowest : 10.4 13.3 14.3 14.7 15 , highest: 26 27.3 30.4 32.4 33.9
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 32 20.09 6.03 19.2 19.7 5.41 10.4 33.9 23.5 0.61 -0.37 1.07
2.2.2 Good practices with packages
When dealing with packages, it is important that we do so carefully and intentionally so that they operate correctly within our R script. Here are some good practices for dealing with packages:
- Because you only install packages once, comment out the
install.packages()
code once you have installed them on your computer. However, it’s a good idea to leave it in your code; it makes the code more easily understandable and reproducible by others (and yourself) later. - Keep your commented out
install.packages()
code, as well as yourlibrary()
code, in a chunk at the beginning of your Markdown. This reduces any problems with trying to run code without first loading packages and keeps them in one central location, which you can easily reference. This also helps you understand if you have created an issue by loading packages like Hmisc and psych in the wrong order. - Next to each
library()
call, write a comment for why you have that library loaded (see the top of this document for an example). - When you come across problems with installing or loading packages, make sure to check out the console for messages from R on what’s wrong; these can often be useful.
2.3 Part II: Tidy data manipulation using dplyr
2.3.1 dplyr intro
One of the most useful packages you can use is actually a bundle of many packages of packages called tidyverse (that is, a “universe of ‘tidy’ packages”). You can see a list of the packages included in tidyverse and learn more about them here.
dplyr, one package within tidyverse, is especially useful for data manipulation. But what exactly is data manipulation? Data manipulation is the process of preparing your data for analysis. Hardly (if ever) do we deal with data that is “clean” from scratch–that is, data that is perfectly organized and ready for analysis. So, you have to clean up your data to be in good enough condition to analyze easily and clearly.
2.3.2 Why learn all of this first if I just want to analyze my data?
Here’s the bottom line: you will have to do this. Honestly, you probably won’t even begin to start to answer the questions you want to ask with your data without first preparing it to ask those questions. It’s like making a recipe–almost always you have to in some way prep the ingredients. You could throw a whole onion and a few carrots into your soup, but it probably won’t be what you expected if you don’t chop them up and simmer them for a while first.
As you get your data and start working with it, you’ll realize what data cleaning problems you have to work through for your specific questions. To give you some kind of framework for what these could look like, here’s a taste of what preparing your data for analysis might look like:
Subsetting or pruning some of your data
You might want to:
- delete useless columns that came with your data. For example, data from Qualtrics (a popular survey software) often has many columns you don’t care about or actually should get rid of to keep your data anonymous (i.e., IP addresses)
- delete rows that contain no data or are deemed as low quality data through the use of attention checks, etc. in your study
- only analyze one part of your data, such as only look at 12th graders, not 9th-11th graders.
- only keep people who meet a certain criteria,
Creating new variables for analysis
You might want to:
- Create scores for composite variables. For example, maybe you are interested in looking at the correlation between self-esteem and well-being from an online survey; you would first need to average (or sum) scores from the various self-esteem items and do the same thing for the well-being items.
- Reverse score variables, such as negatively valenced self-esteem questions (“I feel I do not have much to be proud of”), that you want to combine into an index of self-esteem, where higher scores mean more self-esteem.
- Log variables, such as income, that are often very skewed; or square variables that you want to look at quadratically.
- scaling scores, such as getting the number of correct answers a person has on a test and divide it by the number of items on the test to get a percentage correct.
Summarize some of your variables
You might want to:
- Get the means and standard deviations for your main outcome variables.
- Get those means and standard deviations as they differ across treatment groups
- take more granular data and make it less granular for your analyses. For example, say you have football statistics for every game of a season, but you want to analyze season-level statistics, and need to sum game-by-game data. Or, you have heart rate data and would like to analyze minute-by-minute heart rate rather than second-by-second heart rate
Here’s the good news:
all of these things can be done with skills you will learn in this workshop. Hopefully this gives you some context for what we will learn to do today and how you could apply it to your own work!
2.3.3 Overview of basic dplyr variables
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges (such as those listed above). Here are some of the main features of the package:
select()
selects variables (columns), or renames existing columns.filter()
selects rows that fit one or more logical expressions.rename()
renames columns, and only keep those (note similarity toselect()
)mutate()
adds new variables that are functions of existing variables.summarise()
reduces multiple values down to a single summary.arrange()
changes the ordering of the rows.
These all combine naturally with group_by()
which allows you to perform any operation by group. You can learn more about them in vignette("dplyr")
.
In addition to these basic dplyr “verbs” for manipulating data, dplyr also has functions for more advanced data manipulation like merging different data sets together (e.g., left_join()
) and pivoting data sets (e.g., pivot_wider()
). We include some details on how to get started with those sorts of operations in an Appendix at the end of this workshop.
2.3.4 Select certain variables, i.e. columns (select
)
select
allows you to subset your dataset by column (i.e. by variables). If we wanted our dataset to only include information on the miles per gallon and number of cylinders of each car, we would use select
in the following way:
## mpg cyl
## 1 21.0 6
## 2 21.0 6
## 3 22.8 4
## 4 21.4 6
## 5 18.7 8
## 6 18.1 6
Try it for yourself: create a dataset that consists only of the number gears and the weight of each car.
2.3.5 Filter to certain observations, i.e. rows (filter
)
filter
allows you to subset your dataset by row (i.e. observations), using logical arguments.
Why use logical arguments rather than just selecting a specific row, like select
does with columns? The basic assumption behind this is that your columns are important variables or features of the people you have in your data and rows eahc represent data for a specific person. R assumes that we are often interested in individual variables, or a specific range of variables, but we are rarely interested in a specific person. Rather, it assumes if we want to look at specific people (or rows of data), it’s usually a “type” of rows, or rows that meet a specific criteria–such as people of a certain demographic, students with a certain grade, or people who responded to a survey question in a certain way. Thus, we use logical arguments (==, >, < etc.
) to capture those types of distinctions when filtering rows.
If we wanted our dataset to only include cars that had automatic transmission, we would use filter
in the following way:
## carID mpg cyl disp hp drat wt qsec vs am gear carb qual_eng
## 1 89829393 21.4 6 258.0 110 3.08 3215 19.44 1 0 3 1 4
## 2 96393949 18.7 8 360.0 175 3.15 3440 17.02 0 0 3 2 5
## 3 38278732 18.1 6 225.0 105 2.76 3460 20.22 1 0 3 1 4
## 4 98312322 14.3 8 360.0 245 3.21 3570 15.84 0 0 3 4 6
## 5 35991694 24.4 4 146.7 62 3.69 3190 20.00 1 0 4 2 5
## 6 93784711 22.8 4 140.8 95 3.92 3150 22.90 1 0 4 2 3
## qual_trans qual_bod car_prob
## 1 2 2 2
## 2 4 5 4
## 3 5 6 3
## 4 7 4 6
## 5 1 7 4
## 6 3 7 2
Try it for yourself: create a dataset that only includes cars that get more than 25 miles per gallon.
2.3.6 Rename your variables (rename
)
rename
By looking at how we use select
, filter
, and rename
, can you see a pattern in how dplyr functions are set up?
You’ll notice that rather than using the $
operator between a data set and a variable (like this data$variable
), dplyr always takes the data set as the first argument of the function (the d
in the chunk above), and always takes what we are doing to the rows or variables in the arguments after that (the weight = wt, cylinders = cyl
in the chunk above).
One other thing you’ll notice: rename
uses the form new_name = old_name
, where the name you want to rename a variable to comes before the equal sign, and it’s current name comes after the equal sign. As a shortcut to renaming, this actually also works in select
: you can rename variables as you are selecting them, instead of having to do both steps separately. For example, the following code selects just wt
and cyl
, but also renames them at the same time:
Actually, this method of renaming variables using new_name = something_else
is common to many dplyr variables–for example, dplyr’s functions that create, not just rename, new variables (mutate
and summarise
) use this, too. So it’s useful to remember!
2.3.7 Create a new variable based on another variable (or set of variables) (mutate
)
mutate
allows you to create new variables. Often, this involves manipulating existing variables. For example, some useful things you can do are changing the type of a variable, combining multiple variables, or calculating statistics of variables.
If we wanted to turn the transmission variable into a factor (see the last workshop for an explanation of what a factor is), create a variable for the mean miles per gallon, and compute a variable representing the square of car weight, we would use mutate in the following way:
## carID mpg cyl disp hp drat wt qsec vs am gear carb qual_eng
## 1 37747171 21.0 6 160.0 110 3.90 2620 16.46 0 1 4 4 6
## 2 31796991 21.0 6 160.0 110 3.90 2875 17.02 0 1 4 4 1
## 3 72673293 22.8 4 108.0 93 3.85 2320 18.61 1 1 4 1 2
## 4 89829393 21.4 6 258.0 110 3.08 3215 19.44 1 0 3 1 4
## 5 96393949 18.7 8 360.0 175 3.15 3440 17.02 0 0 3 2 5
## 6 38278732 18.1 6 225.0 105 2.76 3460 20.22 1 0 3 1 4
## 7 98312322 14.3 8 360.0 245 3.21 3570 15.84 0 0 3 4 6
## 8 35991694 24.4 4 146.7 62 3.69 3190 20.00 1 0 4 2 5
## 9 93784711 22.8 4 140.8 95 3.92 3150 22.90 1 0 4 2 3
## 10 45281321 19.2 6 167.6 123 3.92 3440 18.30 1 0 4 4 3
## 11 91316755 17.8 6 167.6 123 3.92 3440 18.90 1 0 4 4 1
## 12 28844296 16.4 8 275.8 180 3.07 4070 17.40 0 0 3 3 1
## 13 63232163 17.3 8 275.8 180 3.07 3730 17.60 0 0 3 3 7
## 14 38421526 15.2 8 275.8 180 3.07 3780 18.00 0 0 3 3 3
## 15 24466383 10.4 8 472.0 205 2.93 5250 17.98 0 0 3 4 5
## 16 97133637 10.4 8 460.0 215 3.00 5424 17.82 0 0 3 4 5
## 17 36825326 14.7 8 440.0 230 3.23 5345 17.42 0 0 3 4 5
## 18 33873446 32.4 4 78.7 66 4.08 2200 19.47 1 1 4 1 6
## 19 79797615 30.4 4 75.7 52 4.93 1615 18.52 1 1 4 2 7
## 20 34178155 33.9 4 71.1 65 4.22 1835 19.90 1 1 4 1 3
## 21 48249435 21.5 4 120.1 97 3.70 2465 20.01 1 0 3 1 1
## 22 37845749 15.5 8 318.0 150 2.76 3520 16.87 0 0 3 2 5
## 23 26692995 15.2 8 304.0 150 3.15 3435 17.30 0 0 3 2 6
## 24 62229968 13.3 8 350.0 245 3.73 3840 15.41 0 0 3 4 4
## 25 34936485 19.2 8 400.0 175 3.08 3845 17.05 0 0 3 2 2
## 26 44187828 27.3 4 79.0 66 4.08 1935 18.90 1 1 4 1 2
## 27 96766418 26.0 4 120.3 91 4.43 2140 16.70 0 1 5 2 1
## 28 66566612 30.4 4 95.1 113 3.77 1513 16.90 1 1 5 2 6
## 29 36635492 15.8 8 351.0 264 4.22 3170 14.50 0 1 5 4 2
## 30 52695278 19.7 6 145.0 175 3.62 2770 15.50 0 1 5 6 7
## 31 67465929 15.0 8 301.0 335 3.54 3570 14.60 0 1 5 8 6
## 32 72536977 21.4 4 121.0 109 4.11 2780 18.60 1 1 4 2 4
## qual_trans qual_bod car_prob mpg_mean wt_sq
## 1 2 1 2 20.09062 6864400
## 2 4 1 6 20.09062 8265625
## 3 2 1 6 20.09062 5382400
## 4 2 2 2 20.09062 10336225
## 5 4 5 4 20.09062 11833600
## 6 5 6 3 20.09062 11971600
## 7 7 4 6 20.09062 12744900
## 8 1 7 4 20.09062 10176100
## 9 3 7 2 20.09062 9922500
## 10 7 2 2 20.09062 11833600
## 11 6 6 6 20.09062 11833600
## 12 5 4 6 20.09062 16564900
## 13 5 1 7 20.09062 13912900
## 14 3 2 3 20.09062 14288400
## 15 2 7 3 20.09062 27562500
## 16 5 2 6 20.09062 29419776
## 17 4 2 5 20.09062 28569025
## 18 3 3 4 20.09062 4840000
## 19 7 6 3 20.09062 2608225
## 20 6 4 2 20.09062 3367225
## 21 1 2 6 20.09062 6076225
## 22 3 7 3 20.09062 12390400
## 23 1 5 5 20.09062 11799225
## 24 7 2 7 20.09062 14745600
## 25 2 7 4 20.09062 14784025
## 26 5 6 5 20.09062 3744225
## 27 5 4 7 20.09062 4579600
## 28 6 7 5 20.09062 2289169
## 29 5 4 1 20.09062 10048900
## 30 6 4 2 20.09062 7672900
## 31 6 5 3 20.09062 12744900
## 32 5 6 6 20.09062 7728400
Pro tip: notice how I organized my code above, by spacing down after each comma. Because functions like rename
, select
, and mutate
can be used for any number of variables you want to change, it can help with creating clean, readable code to organize it this way: one line per operation.
Try it for yourself: create a new variable that tells us the amount of horsepower (hp
) each car has per cylinder (cyl
). Name this variable hp_cyl
2.3.8 Get summary statistics of a variable (summarise
)
summarise
allows us to calculate summary statistics for specific variables in our dataset. You can think of this as doing two things at once: it is manipulating some variable, like mutate
does, but in a way that summarizes, rather than changes, the data. For example, if you wanted the mean of variable A, it does that operation, using the same format as mutate
, but gives you just one number, the mean.
If we wanted to know what the mean miles per gallon of all the cars in our dataset was, we would use summarise
in the following way:
#summarise(dataset, summary_variable = function(old_variable))
summarise(d, mpg_mean = mean(mpg, na.rm = TRUE))
## mpg_mean
## 1 20.09062
Pro Tip: na.rm = TRUE
tells the mean
function to remove all blank cells in your data set from the calculation, which we call missing values and denote with “NA” (‘na.rm’ means “removes NAs”).
Try it for yourself: calculate the median miles per gallon
And with that, we can get the variables and rows from our data that we want, rename variables, create new ones, and summarize the data we have! These few functions can take you a surprising long way in cleaning many types of data. Now let’s look at a few more things that will make these functions easy to use.
2.3.9 Doing a series of functions all at once - using a ‘pipe’ (%>%
)
Sometimes we want to use multiple functions at the same time to do multiple transformations of the data. For example, we might want to select certain variables and then filter to certain observations and save this new subset of data to a new dataset object. Instead of typing many lines of code to do this, we can use a convenient shortcut, called a ‘pipe’, which strings together functions into one coding “sentence”, if you will, to do multiple operations in a single sequence.
Pipes can be a really useful tool to use throughout your coding in R, and there are multiple types of pipes (we’ll learn more about other types in a minute).
When you are using or reading lines of code with pipes, you can read them as saying “and then”. For example, select() %>% mutate()
could be read as “select certain variables and then mutate those variables”.
Here’s what it would look like: Below, we’re going to use the dataset d
, and then select mpg
, disp
, and gear
, and then filter to only include cars with 4 gears, and assign this to a new dataset object called d2
.
Try it for yourself: create a new dataset d3
using d
and then select only cyl
, hp
, and wt
, and then filter to cars with 8 cylinders, and then calculate the mean horsepower for these cars.
2.3.9.1 Another pipe: Compound assignment (%<>%
)
Run each of the lines below separately and see what happens. After running each line, check out d
in your global environment (top right part of the screen). What’s different?
That’s right, the %<>%
back-assigns the output of the operations to overwrite d
. That is, whereas a normal pipe only works one way–using d, it computes a new variable and shows us the output–this compound assignment pipe computes the new variable lwt
using mutate
and then assigns that value back to d
. Think of the extra <
in the operator as saying, “And then write over the object on the right side of the operator”. So this “writes over” your data set with the new additional variable, lwt
. A different way to do this exact same thing is:
Either works! Sometimes I like using the compound assignment pipe d %<>%
because it is more succinct, but you might find overwriting your dataset using d <- d
more clear and easier to remember exactly what you did.
2.3.9.2 And one more: The money pipe (%$%
)
The following code uses a handy function called with()
, that creates a plot “with” the dataset d
. This can be useful so that you don’t have to write out both (or even more) variables using the $
operator (plot(d$mpg ~ d$wt)
.
How might we rewrite the following code using a pipe?
Though we might intuit that we could say, d %>% plot(mpg ~ wt)
, that actually wouldn’t work. We would have to write it as follows:
Just like the $
that we use when calling a variable from a dataset (e.g. d$mpg
), this pipe helps call the data and variables for certain functions.
The %$%
pipe enables us to call from the data set for functions that don’t normally have a data argument. That is, any function that would normally want the variables called using a $
.
In this case, since you can’t just include the data as the first argument of the plot()
function like you can for the dplyr functions, you would normally have to use the with()
function to call the data. Think of other functions we have used that are like this in the last workshop. mean(d$wt)
, for example, could be written as with(d, mean(wt))
or d %$% mean(wt)
.
So, if we use the normal pipe operator (%>%
) to call the data (d %>% plot()
), the plot()
function still isn’t able to recognize the variables, because it doesn’t normally have a data argument. Herein, we can use %$%
, which allows us to call and use the data set, even for functions without a data argument.
2.3.10 Doing dplyr operations by group (group_by
)
In our research, we are often interested in comparing different groups, or manipulating data by different groups. Try thinking of a few different groupings you might be interested in looking at or comparing in data.
If you wanted to calculate a statistic by group in your data, you can use group_by
, a pipe, and summarise
. So, for example, if we wanted to calculate the mean miles per gallon for cars with automatic vs. manual transmissions, you would use group_by
in the following way:
## # A tibble: 2 × 2
## am mean_mpg
## <dbl> <dbl>
## 1 0 17.1
## 2 1 24.4
Try it for yourself: calculate the average weight of cars based on the number of cylinders that they have.
2.3.11 Bringing it all together
2.3.11.1 How can we use these tools to build effective code?
With these tools, we can now build many combinations of functions to manipulate our data how we want. Remember: the ultimate, practical reason for these functions is to allow you to clean and manipulate your data in the way you need to for your research. As we go through the next examples, try to think of ways you can apply this to the research questions and data you might be interested in.
Consider one example below. We are interested in examining the means of mpg
and wt
, but only for cars with mpg above 20 (i.e. moderately well performing cars). Additionally, we want to see how cars with different numbers of cylinders and transmission types are different on mpg
and wt
; so, we also organize these results by number of cylinders (cyl
) and transmission type (am
). Here is one way we could write this code, utilizing the functions group_by()
, select()
, summarise()
, and filter()
.
grouped_cars <- group_by(d, cyl, am)
cars_data <- select(grouped_cars, cyl, am, wt, mpg)
summarized_mpg <- summarise(cars_data,
wt.mean = mean(wt, na.rm = TRUE),
mpg.mean = mean(mpg, na.rm = TRUE))
## `summarise()` has grouped output by 'cyl'. You
## can override using the `.groups` argument.
## # A tibble: 3 × 4
## # Groups: cyl [2]
## cyl am wt.mean mpg.mean
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 2935 22.9
## 2 4 1 2042. 28.1
## 3 6 1 2755 20.6
OK. Yes, that is one way to do it…but I hope you are normal like me and agree that requires some mental gymnastics to understand. And, I mean, come on, it’s not very efficient, right? You have to create like 3 different objects (grouped_cars, cars_data, summarized_mpg) that serve no purpose but to get to the final_result.
Now consider the code below; though it looks a bit different, it does the exact same thing as the chunk above!
This new code, however, utilizes pipes %>%
to make the code more succinct, more organized, and more intuitive. Instead of creating many new objects, it simply says, “Using data set d, group variables by cyl
and am
and then–using only cyl
, am
, wt
, and mpg
– summarize the means of each group, and then only keep rows where the mean of mpg
is above 20.” Each line of code builds upon the last, allowing us to get the same result as above with less code.
d %>%
group_by(cyl, am) %>%
select(cyl, am, wt, mpg) %>%
summarise(wt.mean = mean(wt, na.rm = TRUE), mpg.mean = mean(mpg, na.rm = TRUE)) %>%
filter(mpg.mean > 20)
## `summarise()` has grouped output by 'cyl'. You
## can override using the `.groups` argument.
## # A tibble: 3 × 4
## # Groups: cyl [2]
## cyl am wt.mean mpg.mean
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 2935 22.9
## 2 4 1 2042. 28.1
## 3 6 1 2755 20.6
2.3.11.2 Solving our most common data manipulation challenges using the grammar of dplyr
Now that we have these tools from dplyr and have learned its basic ‘grammar’ of data manipulation, we can solve some of our common data manipulation challenges in psychological research. Here’s 3: recoding variables, creating composite variables, and centering. Can you think of others that would be useful for your research?
1. Recoding variables
Our data set includes 4 items of car quality as rated by expert mechanics. Based on ratings from mechanics, each car gets a score between 1 (Very poor) to 7 (Excellent) for engine quality (qual_eng
), transmission quality (qual_trans
), and body quality (qual_bod
). Additionally, mechanics were asked, “How frequently do you estimate this car is apt to have problems?” on a scale from 1 (not at all frequently) to 7 (extremely frequently) (car_prob
). Together, these 4 items create the Car Quality Index (CQI). In our research problem here, we are most interested in looking at the index as one score rather than 4 separate scores.
But we can’t just combine the items together as they are: in order to combine these items into an index, we must first change the values for the variable car_prob
: whereas the other variables indicate higher car quality with higher values, car_prob
indicates lower car quality with higher values, and thus must be recoded.
You can recode Likert-type items by subtracting the values in the variable by one more than the max of the scale. Thus, a score that was 7–subtracted from 8–now becomes 1, 6 becomes 2, 5 becomes 3, and so on.
Note that my convention is to create a new variable that has the same name as the original variable with _r
added on the end. This lets me keep both variables in my dataset but makes it clear that the new one is recoded. Having some kind of rule or convention you use for recoding is important so that you don’t lose track of which variables have been reversed and which have not.
2. Creating composite variables
Now we can combine our 4 items into an index (i.e. a composite variable that is the mean of ratings for each car). To make this easiest, a friend of mine made a handy function for computing composite variables. Here, we will call it gen_comp()
(i.e., ‘generate composite’). There are two steps to creating a composite variable:
Create a vector (i.e. a grouped list) of each of the items that will go into the composite. You can use this vector as an object to compute descriptives, build tables, etc. with all of the variables in the vector at once. Also, its handy for creating composite variables.
Input the new variable name and the name of the vector it is pulling from into the function
gen_comp()
.
## The code for our new function (no need--at this point--to understand this code completely)
gen_comp <- function(data, comp, vector){
comp <- enquo(comp)
data %>%
rowwise() %>%
mutate(!!quo_name(comp) := mean(c(!!!vector), na.rm = TRUE)) %>%
ungroup()
}
## Step 1
vector_cqi <- quos(qual_eng, qual_trans, qual_bod, car_prob)
## Step 2
d %<>% gen_comp(comp = cqi, vector = vector_cqi)
3. Centering
Another common manipulation we may want to do is mean centering our data. Mean centering means that, for a given variable, the mean becomes 0 and all other scores are presented in terms of their relative distance from the mean (negative for below the mean and positive for above the mean).
Here, I’ve created a really simple function (var.center
) that takes advantage of the function scale()
for mean centering our data. Normally, scale()
will standardize the data (mean zero with each score representing how many standard deviations it is away from the mean). However, we can tell it to just subtract the mean from scores by indicating scale = FALSE
. To make it easier, we have just made a function that does this, but only requires you to input the variables you want to operate on. For other ways to center and for the original code of this custom function, check out this website.
Alternatively, you can center old school: compute means for each variable and subtract that mean from the variable.
## Code for the new function
var.center <- function(x) {
scale(x, scale = FALSE)
}
## example of new function
d %<>% mutate(
wt_c2 = var.center(wt),
cqi_c2 = var.center(cqi)
)
## doing it the old school way
d %<>% mutate(
mean_wt = mean(wt),
wt_c = wt - mean_wt,
mean_cqi = mean(cqi),
cqi_c = cqi - mean_cqi
)
d %>% select(wt_c, wt_c2, cqi_c, cqi_c2) #Notice that both methods give you the same results
## # A tibble: 32 × 4
## wt_c wt_c2[,1] cqi_c cqi_c2[,1]
## <dbl> <dbl> <dbl> <dbl>
## 1 -597. -597. -1.40 -1.40
## 2 -342. -342. -1.15 -1.15
## 3 -897. -897. -1.40 -1.40
## 4 -2.25 -2.25 -1.65 -1.65
## 5 223. 223. 0.352 0.352
## 6 243. 243. 0.352 0.352
## 7 353. 353. 1.60 1.60
## 8 -27.2 -27.2 0.102 0.102
## 9 -67.2 -67.2 -0.398 -0.398
## 10 223. 223. -0.648 -0.648
## # ℹ 22 more rows
And many other useful transformations and manipulations:
# rescaling, computing the log
d %<>% mutate(
wt_s = wt/1000, #scaling weight down to
lmpg = log(mpg) # creating the log of mpg
)
Here are some other useful things you can now do:
## `summarise()` has grouped output by 'cyl'. You
## can override using the `.groups` argument.
## # A tibble: 6 × 3
## # Groups: cyl [3]
## cyl am mn
## <dbl> <dbl> <dbl>
## 1 4 0 22.9
## 2 4 1 28.1
## 3 6 0 19.1
## 4 6 1 20.6
## 5 8 0 15.0
## 6 8 1 15.4
## qual_eng qual_trans qual_bod
## Min. :1 Min. :1.000 Min. :1.000
## 1st Qu.:2 1st Qu.:2.750 1st Qu.:2.000
## Median :4 Median :5.000 Median :4.000
## Mean :4 Mean :4.219 Mean :4.125
## 3rd Qu.:6 3rd Qu.:6.000 3rd Qu.:6.000
## Max. :7 Max. :7.000 Max. :7.000
#d %>% select(mpg) %$% describe(.) %>% round(., digits = 2)
#d %>% select(!!!vector_cqi) %>% psych::describe() %>% round(., digits = 2)
d %>%
rowwise() %>% #rowwise() used to do an operation by rows rather than columns--row means, etc.
mutate(mymean=mean(c(cyl,mpg))) %>%
select(cyl, mpg, mymean)
## # A tibble: 32 × 3
## # Rowwise:
## cyl mpg mymean
## <dbl> <dbl> <dbl>
## 1 6 21 13.5
## 2 6 21 13.5
## 3 4 22.8 13.4
## 4 6 21.4 13.7
## 5 8 18.7 13.4
## 6 6 18.1 12.0
## 7 8 14.3 11.2
## 8 4 24.4 14.2
## 9 4 22.8 13.4
## 10 6 19.2 12.6
## # ℹ 22 more rows
##Other examples of what we use mutate for in psychology...creating new variables,
d %>% select(!!!vector_cqi) %>% as.matrix() %>% Hmisc::rcorr()
## qual_eng qual_trans qual_bod car_prob
## qual_eng 1.00 0.11 0.08 -0.21
## qual_trans 0.11 1.00 0.01 0.10
## qual_bod 0.08 0.01 1.00 -0.21
## car_prob -0.21 0.10 -0.21 1.00
##
## n= 32
##
##
## P
## qual_eng qual_trans qual_bod car_prob
## qual_eng 0.5472 0.6487 0.2528
## qual_trans 0.5472 0.9615 0.6004
## qual_bod 0.6487 0.9615 0.2483
## car_prob 0.2528 0.6004 0.2483
Though dplyr manages to provide ways to solve most of the data manipulation challenges we will come across in just a handful of functions, the functionalities may still be confusing or difficult to remember at first. If only there was a way to remember all of these functions…Wait–there’s a [cheatsheet] for dplyr, too?? Wow, R Studio, you have really outdone yourself this time.
2.3.11.3 Changing variable type
Remember the four variable types we talked about last time? Often we have occasion to need to change our variables from one type to another–each act differently in analyses and we have to make sure we have the right variable type for what we are trying to do.
For example, we might want to treat some variables as qualitative, nominal factors rather than continuous, numeric integers. In R, we must specify which variables to treat as factors if the levels (i.e., unique values) of the variable are composed of numbers instead of strings. Note that if the variable (e.g., “ID”) levels start with a letter (e.g., “subject1”, “subject2”) R will automatically interpret the variable as a factor. If the variable levels start with a number (e.g., “1”, “2”), R with automatically interpret the variable as an integer. If you want the variable interpreted differently, you have to tell R.
For instance, the variable mpg
is continuous, but am
is not. However, since the levels of am
are indicated with numbers, we must tell R to treat am
as a factor:
## the function factor() converts to a factor and the option labels specifies names to assign to the levels
d %<>%
mutate(am = factor(am, labels = c("Auto", "Manual")))
#Other alternative
#d$am = factor(d$am, labels = c("Auto", "Manual"))
Now we can look at the structure of the d
data frame again, to make sure am
is now a factor:
## tibble [32 × 27] (S3: tbl_df/tbl/data.frame)
## $ carID : num [1:32] 37747171 31796991 72673293 89829393 96393949 ...
## $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## $ disp : num [1:32] 160 160 108 258 360 ...
## $ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat : num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:32] 2620 2875 2320 3215 3440 ...
## $ qsec : num [1:32] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : Factor w/ 2 levels "Auto","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear : num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb : num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
## $ qual_eng : num [1:32] 6 1 2 4 5 4 6 5 3 3 ...
## $ qual_trans: num [1:32] 2 4 2 2 4 5 7 1 3 7 ...
## $ qual_bod : num [1:32] 1 1 1 2 5 6 4 7 7 2 ...
## $ car_prob : num [1:32] 2 6 6 2 4 3 6 4 2 2 ...
## $ lwt : num [1:32] 7.87 7.96 7.75 8.08 8.14 ...
## $ car_prob_r: num [1:32] 6 2 2 6 4 5 2 4 6 6 ...
## $ cqi : num [1:32] 2.75 3 2.75 2.5 4.5 4.5 5.75 4.25 3.75 3.5 ...
## $ wt_c2 : num [1:32, 1] -597.25 -342.25 -897.25 -2.25 222.75 ...
## ..- attr(*, "scaled:center")= num 3217
## $ cqi_c2 : num [1:32, 1] -1.398 -1.148 -1.398 -1.648 0.352 ...
## ..- attr(*, "scaled:center")= num 4.15
## $ mean_wt : num [1:32] 3217 3217 3217 3217 3217 ...
## $ wt_c : num [1:32] -597.25 -342.25 -897.25 -2.25 222.75 ...
## $ mean_cqi : num [1:32] 4.15 4.15 4.15 4.15 4.15 ...
## $ cqi_c : num [1:32] -1.398 -1.148 -1.398 -1.648 0.352 ...
## $ wt_s : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## $ lmpg : num [1:32] 3.04 3.04 3.13 3.06 2.93 ...
## NULL
2.3.11.4 Creating factors from continuous variables
Sometimes, we may want to break up a continuous variable into intervals (e.g. for age: 18 - 24, 25 - 30, 30 +). In our dataset, for simplicity, we might want to look at gas mileage as an ordered factor, not a quantitative variable. By making mpg
into a factor, we are able to group cars into categories based on their respective mpg. So let us create a new factor, mpg_cat
which can be ‘low’, ‘medium’, or ‘high’. Given the mpg
variable, we can create a new categorical variable (i.e., factor) by specifying breaks at specific intervals (see below).
Here, we use the dplyr
function case_when()
, which says, “when this is true,” (the left side of the ~
), “make this happen” (the right side of the ~
). So, for the first line, “when mpg
is less than 17, give the new variable, mpg_cat
the value ‘Low’.” Then we use the function ordered()
to specify that low, medium, and high go in a specific order (as opposed to levels like red, blue, and yellow, which have no inherent order).
d %<>%
mutate(
mpg_cat = case_when(mpg < 17 ~ "Low",
mpg >= 17 & mpg < 24 ~ "Medium",
mpg >= 24 ~ "High"),
mpg_cat = ordered(mpg_cat,levels = c("Low","Medium","High")))
These break points result in 3 mpg categories: below 17, 17:23.9, and 24 and up. We can also visualize these groups:
2.3.12 Saving your data and going home
After you have manipulated your variables, you will save a new dataset. This can then be used in data analysis. It is best to create one markdown for data manipulation, clean your data, and then save a new, totally clean and ready-to-go data file to use for your analyses in a new markdown.
Something important: Unlike SPSS and Excel, the default in R is not to save your computed variables. When we work with R, we import data into a new space (a data frame) and then work within that space. In SPSS and Excel, you’re always editing the source file and therefore all of your changes can be saved. If you want to save your computed variables in a .csv file, you’ll need to write a new file. But fear not–there’s a simple command that does just that. Let’s say we want to save our newly computed variables and cleaned data set into a permanent R data file. We’d do this:
Notice that we saved our new data file as an .rds file; this is a file extension designating an R data file. When we want to read in the data in a new markdown, we can use the code read_rds('data.clean.rds')
.
2.4 Review: End Notes
2.4.1 Feedback
As a learner, your superpower is knowing what is and isn’t working for your learning. If you have 2 minutes, we would love if you shared your superpower with us!
Scan the QR code below with your phone to provide brief feedback on this workshop:
2.4.2 What’s an R Markdown again?
This is the main kind of document that I use in RStudio, and it’s the primary advantage of RStudio over base R console. R Markdown allows you to create a file with a mix of R code and regular text, which is useful if you want to have explanations of your code alongside the code itself. This document, for example, is an R Markdown document. It is also useful because you can export your R Markdown file to an html page or a pdf, which comes in handy when you want to share your code or a report of your analyses to someone who doesn’t have R. If you’re interested in learning more about the functionality of R Markdown, you can visit this webpage
R Markdowns use chunks to run code. A chunk is designated by starting with {r}
and ending with
This is where you will write your code. A new chunk can be created by pressing COMMAND + ALT + I on Mac, or CONTROL + ALT + I on PC.
You can run lines of code by highlighting them, and pressing COMMAND + ENTER on Mac, or CONTROL + ENTER on PC. If you want to run a whole chunk of code, you can press COMMAND + ALT + C on Mac, or ALT + CONTROL + ALT + C on PC. Alternatively, you can run a chunk of code by clicking the green right-facing arrow at the top-right corner of each chunk. The downward-facing arrow directly left of the green arrow will run all code up to that point.
2.4.3 Some useful resources to continue your learning
A useful resource, in my opinion, is the stackoverflow website. Because this is a general-purpose resource for programming help, it will be useful to use the R tag ([R]
) in your queries. A related resource is the statistics stackexchange, which is like Stack Overflow but focused more on the underlying statistical issues.
Add other resources
One of the best resources for learning how to use R well, in a “tidy” way, is R for Data Science(R4DS). This contains a good intro to using dplyr, as well as a solid general intro to R.