One of the more common data visualizations is the display of bivariate data in an XY scatterplot. In R there are numerous ways of obtaining basic and enhanced versions of the scatterplot. For purposes of quick EDA, simple methods exist. For presentation graphics and publication quality products, more effort is required. This document provides the range of those possibilities using both base system graphics, add-on package functions, the ggplot2 approach which is fast becoming a de facto R standard, and illustrations with a Plotly implementation for R.
At the end of the document, two approaches to 3D scatterplots are shown for trivariate data. Doing 3D scatterplots well is a bit more of a technical challenge and the readily accessible options are limited. One shown here is also one that can take advantage of RGL capabilities to grab the plot with a mouse and perform rotation such that perspectives can be maximized. A second approach, using Plotly, has the interactive capability built in.
2 The R Environment
Several packages are required for the work in this document. They are loaded here, but comments/reminders are placed in some code chunks so that it is clear which package some of the functions come from.
Although several data sets are used in this document, one data set will be the primary exemplar and is described here. Howell’s table 9.2 (7th or 8th edition) (Howell, 2014). has an example of the relationship between stress and mental health as reported in a study by Wagner, et al., (1988, as cited in the Howell text). This Health Psychology study examined a variable that was the subject’s perceived degree of social and environmental stress (called “stress” in the data set”). There was also a measure of psychological symptoms based on the Hopkins Symptom Checklist (called “symptoms”) in the data file. One interesting aspect of the variables in the data set is that they both have some positive skewness. With that in mind some scatterplots are enhanced with marginal plots of the univariate distributions of both variables. Other scatterplots will help visualize the fact that the bivariate distribution of the two variables is not bivariate normal.
First, we will read the data file. The data are found in a .csv file called “howell_9_2.csv”.
The data frame is “attached” to make variable naming simpler/shorter in later functions.
Show/Hide code
# read the csv file and create a data frame called data1data1 <-read.csv(file="data/howell_9_2.csv")# notice that the stress variable was read as a "integer". # We will change it to a numeric variable and will do the same for symptomsstr(data1)
'data.frame': 107 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ stress : int 30 27 9 20 3 15 5 10 23 34 ...
$ symptoms: int 99 94 80 70 100 109 62 81 74 121 ...
Building on the work of Tufte (2001) and Cleveland (1984; 1994), we can review some of the basic characteristics of an xy scatterplot that reflect best practices for scientific graphing.
The data rectangle should be only slightly smaller than the scale/axis rectangle. This often implies that axis scales are truncated.
A visual indicator of axis truncation is desireable.
Scatterplots work best in a 1:1 aspect ratio. I.e., they should be square.
Care should be taken to find a good size for points in the scatterplot so that they are easily visible, yet not too prominent.
Ticks marks should be limited in number to that necessary for visualization of the scale - too many tick marks and tick labels clutter the plot.
Font choices and sizes require care, as do line widths. Very often, default font sizes are too small. These choices depend on the medium in which the plot is displayed.
Use color judiciously - as an important element of the graph and not just decoration.
Show the data! (Tufte’s axiom). This implies things like rug plots for univariate dimensions.
The following images set up some basic components and permit understanding points 1 and 3.
First, we can define the scale rectangle:
Next, we define the data rectangle:
This next plot violates best practice #1 and is a poor figure. It ignores Tufte’s axiom: “show the most data in the smallest space with the least amount of ink”.
The basic graphing functions shown in this document fulfill these practices with varying degrees of success. The default algorithms in base R and ggplot do a fairly good job of points 1, 3, 4, and 7. Ggplot does well with points 1, 3 and 4, but often needs work on the other points. Plotly routines are intended for broader data visualization purposes and often violate some of these principles since the intended audience is not necessarily the scientific community. Scientific graphing practices are more conservative, especially with regard to colors and fonts. The goal is effective/efficient communication of information without distracting, or misleading the reader, and without gratuitous/superfluous decoration.
5 XY Scatterplots with base system graphics
One of the most rudimentary graphing functions in base R is the plot function. It has broad capabilities, but when two variables are passed to it, its default is to draw a scatterplot. The first variable passed to the function is the X axis variable. The rapidity of using this function makes it a prime candidate as the go-to EDA method for a quickly obtained scatterplot.
Show/Hide code
## Basic default scatterplot## Also try to use RCmdr to draw the plotplot(stress, symptoms)
Many enhancements are possible within the plot function and associated functions that add elements to the base plot. Compare comments here to the code in the next code chunk. In that next graph, the axes are limited to a range that is specified to accomodate rug plots and a Y axis break indicator. Labels for the two axes can be tailored to needs. The character of the plotted points is controlled by the “pch” argument and the outline and background colors of the points can also be chosen (see the provided document on R colors). The important “cex” arguments control the sizing of graph elements. The point sizes are specified as 1.7 times the default size.
The aspect ratio of a scatterplot can be controlled by the “asp” argument. It was not used in this illustration because in R Markdown files it is better to control figure sizing within the {r} code chunk definition using “fig.width” and “fig.height”.
Once the basic plot is defined, several other functions permit additions onto that active plot. The abline function permits drawing a line. Here, the line is extracted from the linear regression object produced by the lm function and then drawn in a specific color. Rug plots can be added to specific axes. Symptoms is the Y axis variable and it could be added to either the bottom (side 1) or top axes (side 3, as was done here). Although not truly necessary with this data set, the jitter function was used with the stress variable to show how small amounts of jittering can be done with variables that have many overlapping/redundant values. The mtext argument places a title on the graph in a preferred location.
A very useful attribute is controlled by the axis.break function. In cases where axes are truncated, it is always recommended to give the viewer a visual indicator of the fact that an axis does not extend to zero. This axis.break function, from the plotrix package provides this capability and is used for the Y axis here (side 2). One can ask for a “zig-zag” break or a “double slash”. The former is requested here and placed appropriately on side 2 at a value of 52. The “double slash” will be illustrated in a later plot.
Show/Hide code
# Scatterplot with rug plotsplot(stress, symptoms,ylim=c(50,150), xlim=c(0,61),xlab="Stress",ylab="Symptoms",pch=21, col="white", bg="skyblue2", cex=1.7)abline(lm(symptoms~stress), col="red") # regression line (y~x)rug(jitter(stress, amount=.03),side=3,col="gray")rug(symptoms,side=4,col="darkgray")mtext("Scatterplot With Rugplots of IV and DV", side=3, outer=TRUE, line=-1)# use a plotrix function to add axis breaks when axes are truncated#library(plotrix)axis.break(2,52,style="zigzag")
6 Base R Scatterplots with marginal univariate distributions.
One can combine some aspects of univariate EDA with the bivariate scatterplot by drawing various visualizations in the margins of the scatterplot, with the same scale properties of each of the X and Y variables. The two examples here are done manually with the first and with a ready-made function for the second.
6.1 Adding Boxplots to the margins
In order to create this more complex graph, the graphics window is carved up into four areas using the par function. The main scatterplot occupies 80% of each of the x and y axes and the boxplots the remaining 20%. In the first par function call, the “fig” argument indicates these 0-80 ranges for x and y and thus leaves the 80-100 percent range for the boxplots. The scatterplot is then drawn with plot and it is placed in this 80% x 80% square. Later par functions specify the locations for the boxplots within the remainder 20% rectangles. Each plot may require “fiddling” with the par fig specifications to obtain the best look. It is important to note that when the par values are coordinated for the first and later par calls, the univariate plots are placed in the correct/identical scaled locations for the variables. This is easy to see by looking at the outliers in the box plots and finding them in the scatterplot.
Note that ggplot and ggMarginal are probably better choices for this type of plot (see below) although axis breaks cannot be applied with those functions.
Show/Hide code
# Scatterplot with boxplotspar(fig=c(0,0.8,0,0.8))plot(stress, symptoms,ylim=c(50,150), xlim=c(0,61),xlab="Stress",ylab="Symptoms",pch=21,col="white",bg="skyblue2",cex=1.7)abline(lm(symptoms~stress), col="red") # regression line (y~x)axis.break(2,52,style="zigzag")par(fig=c(0,0.8,0.45,1), new=TRUE)boxplot(stress, horizontal=TRUE, axes=FALSE)par(fig=c(0.5,1,0,0.8),new=TRUE)boxplot(symptoms, axes=FALSE)mtext("Scatterplot with Univariate Boxplots", side=3, outer=TRUE, line=-3)