The data set used in the first several chapters here comes from an experiment described in the textbook by Hays (1994), section 10.16, pg 399. The study examined whether asynchronous presentation of video and audio recordings of speakers impaired memory of what was viewed/heard. It is a one-factor design, 3 levels. In the synchronous condition (control) audio and visual recordings were synchronized. In the two asynchronous conditions the audio was slightly ahead of the visual image of the speaker’s lips (fast) or slightly behind the visual image (slow). This independent variable is called “factora” in the data set. The dependent variable (called dv in the data set) is number of words recalled from a list of 50 heard by the participant from the presentation of the recording.
2.1 Import and prepare the data set
The data set is found in a .csv file and has equal sample sizes of ten per group. It is thus a “balanced” design, and a “completely randomized” design.
Show/Hide Code
hays <-read.csv("data/hays1.csv",header=TRUE, stringsAsFactors = T)# better yet, implement the file.choose function so that # you don't have to change the default folder# hays <- read.csv(file.choose(),header=TRUE, stringsAsFactors=T)# note that the initial data frame sees the dv as # integer rather than numeric, so....hays$dv <-as.numeric(hays$dv)# show the structure of the data framestr(hays)
#hays# I will use the attach function here to simplify our code even though# current best practices in R recommend against using it.# The downside of using attach is not encountered in this illustrationattach(hays)
2.2 Numerical Summaries
Basic description of the DV as a function of factora is accomplished the the ‘describeBy’ function from psych.
Show/Hide Code
psych::describeBy(dv, group=factora, type=2)
Descriptive statistics by group
group: control
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 10 27.4 5.1 27.5 27.38 3.71 19 36 17 -0.04 -0.09 1.61
------------------------------------------------------------
group: fast
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 10 21.2 5.2 20.5 20.88 4.45 15 30 15 0.66 -0.65 1.65
------------------------------------------------------------
group: slow
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 10 21.8 2.53 22.5 22 2.22 17 25 8 -0.7 -0.29 0.8
2.3 Graphs of Data by Group
There are a great many ways to draw graphs to display data of the type found even in this simple 3-group design. R can produce them with varying degrees of complexity in the code. Several will be demonstrated here. Some of them are quick tactics to “get to know” your data. Others can be seen as close to publication quality.
2.3.1 Simple EDA Plots
First, we can quickly obtain a simple and very basic plot called a stripchart.
Show/Hide Code
stripchart(dv~factora, method="stack")
An alternative to the base system stripchart shown above is a simple dot plot. Although there are many ways to accomplish this in R, the dotplot function from the lattice package is simple to use.
It is always useful to obtain boxplots of the DV for the all groups in a 1-way design. With base system graphics it is possible to display data from multiple conditions on the same plot.
Show/Hide Code
boxplot(dv~factora, data=hays,col="lightgray",ylim=c(0,40),main="1-Way treatment study Box Plot Illustration", xlab="Treatment Group",ylab="Mean Words Recalled")
What if we want to truncate the axis and show an axis break? (can’t do in ‘ggplot2’)? Let’s do it for this boxplot. The axis.break function from the plotrix package accomplishes this.
Show/Hide Code
#library(plotrix) # to obtain the axis.break functionboxplot(dv~factora, data=hays,col="lightgray",ylim=c(8,40),main="1-Way treatment study Box Plot Illustration", xlab="Treatment Group",ylab="Mean Words Recalled - note axis truncation")plotrix::axis.break(2,8,style="slash")
One of Tufte’s axioms is “show the data.” Here, using the beeswarm package, we can draw a box plot with raw data points overlaid. Note that the beeswarm function could have been executed after either of the boxplot graphs drawn above as well. Here it produces the simplest/rudimentary boxplot.
In many situations it is helpful to visualize each group’s DV distribution with a frequency histogram. The data set used in this document has small sample size per group, so the frequency histogram is not likely to be useful beyond what can be seen with the simpler stripchart above. Nonetheless, it is useful to put code in place as a template for use in situations where sample size is larger. The approach taken here uses ‘ggplot2’ to create a plot that employs “small multiples”, where different panels are drawn for each group. It is also possible to create a layout of histograms with the ‘layout’ function for base system graphics, and that approach has already been demonstrated in prior R tutorials.
Here, I kept the ggplot fairly simple so as to be useful for quick EDA.
Show/Hide Code
library(ggplot2)# first, the core xy plot specshbase <- ggplot2::ggplot(hays, aes(x = dv))# now the base plothbase1 <- hbase +geom_histogram(aes(y=..density..), # change y axis to density bins=8, colour="black", fill="white") +geom_density()# now add the specification that creates separate panels for eaach grouphbase2 <- hbase1 +facet_grid(factora ~ .)hbase2
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
2.3.3 Bar Graphs with Error Bars
The applied sciences rely heavily on a type of bar plots of means, with std errors (or CI’s, or SD’s) displayed as “whiskers”. There is a bit of cottage industry on the web that is heavily critical of these types of graphs, focusing on the fact that information is lost when the data are presented this way. Just do a google search on “dynamite plots” and you will find many blog posts and textbook sections that argue for doing away with them.
Perhaps this is why they are not always simple to obtain in R. Judicious use of them as summary graphs for “ANOVA” types of analyses strikes me as OK. But if one finds that the data have issues of skewness, outliers, or heteroscedasticity, then “showing the raw data points” might be a better approach (c.f. Tufte) - adding raw data onto the bar graph as outlined in the “scientific graphing practices” part of the course.
Notwithstanding this criticism, I have generated some examples of how to draw these graphs in R. I still feel that commercial graphing software such as SigmaPlot or Origins would be a better choice for publication quality graphs of these types.
In the ‘psych’ package, the error.bars.by function gives
a good plot with 95% CI errors as the default.
Other % CI’s can be specified - read the help file: ?error.bars.by
colors of the bars are chosen simply to illustrate the capability.
Use of varying colors would probably not a good choice for either presentation or publication - refer to the BCD discussions of scientific graphing practices.
Show/Hide Code
psych::error.bars.by(hays$dv,hays$factora,bars=TRUE, ylim=c(0,35),main="1-Way treatment study \n Means +/- 95%CI",xlab="Treatment Group",ylab="Mean Words Recalled",labels=c("control","fast","slow"),#colors=c("lemonchiffon","forestgreen", "springgreen4"))colors=c("lightgray","lightgray", "lightgray"))
I found, in the asbio package, a function to generate bar graphs +/- either std errors or CI’s. Here is a graph with means +/- std errors.
Show/Hide Code
#require(asbio)asbio::bplot(dv,factora,int="SE",xlab="Treatment Group", ylab="Mean Number of Words", print.summary=F)
Perhaps a nicer way is available from the from ‘sciplot’ package. default gives +/- 1 std error:
Show/Hide Code
#require(sciplot)#win.graph() # or quartz() or x11()sciplot::bargraph.CI(factora,dv,lc=TRUE, uc=TRUE,legend=T,cex.leg=1,bty="n",col="gray75",ylim=c(0,33),ylab="Mean DV Score",main="Base 1-Way Design Illustration",cex.names=1.25,cex.lab=1.25)box()
Show/Hide Code
#axis(4,labels=F)
Next, we can put an axis break on the Y axis to give a visual indicator of the scale break.
Show/Hide Code
# use the axis.break function from plotrix to give the slasheslibrary(plotrix)#win.graph() # or quartz() or x11()sciplot::bargraph.CI(factora,dv,lc=TRUE, uc=TRUE,legend=T,cex.leg=1,bty="n",col="gray75",ylim=c(6,33),ylab="MEAN DV Score",main="Base 1-Way Design Illustration",cex.names=1.25,cex.lab=1.25)box()#axis.break(4,7,style="slash")plotrix::axis.break(2,7,style="slash")
2.3.4 GGPLOT2 can draw boxplots or bar graphs with error bars
The ggplot2 package is a go-to package for visualizations in the data science universe. It is broadly capable and has a different style of programming than base system graphics. It builds on a “grammar of graphics” philosophy, coming from Wilkinson, that is well aligned with object oriented programming. In separate tutorials, basics of ggplot2 programming are taught.
Initally, we will build a boxplot and then a series of bar graphs. Many of the stylistic attributes of a ggplot graph are controlled by the “theme” function. A later section reviews a few alternatives, but for this graph, I prefer a minimal suite of style attributes. This is also a non-standard boxplot in that I added an indicator of the mean of each group since boxplots traditionally only display the median. The blue point indicates the mean. By traditional boxplot definitions, we see no outliers - remembering the relatively small sample sizes, this is not surprising.
Show/Hide Code
pbox <- ggplot2::ggplot(hays, aes(x=factora, y=dv)) +geom_boxplot() +stat_summary(fun = mean, geom ="point",shape =16, size =3.5, color ="blue") +xlab("Treatment Group") +ylab("Number of Words") +ggtitle(" Boxplot of Number of Recalled Words\n Means are added with blue point") +theme_minimal() +theme(text=element_text(size=12))pbox
With the traditional bar graph for displaying group means, a decision has to be made about inclusion of error bars. The primary tradition is display of +/- one standard error of the mean, but this prompts some thought about just what the purpose of error bars on a graph is this type is. Some discussion of that takes place in the scientific graphing practices section of the course and a literature is found in the toolkit bibliography as well.
If we are going to plot means and add error bars, the approach requires some preliminary work to establish the means, std errors, sd’s, and CI’s to be used. By default the CI is a 95% range but the value can be changed.
The summarySE function from Rmisc produces a data frame that has the summary stats by group.
ggplot cannot work directly on the data frame to produce this type of plot. We need the summary statistics first.
Show/Hide Code
# now use summarySE on our datahays_summ <- Rmisc::summarySE(hays, measurevar="dv", groupvars="factora",conf.interval=.95)str(hays_summ)
'data.frame': 3 obs. of 6 variables:
$ factora: Factor w/ 3 levels "control","fast",..: 1 2 3
$ N : num 10 10 10
$ dv : num 27.4 21.2 21.8
$ sd : num 5.1 5.2 2.53
$ se : num 1.61 1.65 0.8
$ ci : num 3.65 3.72 1.81
Show/Hide Code
# rename the column that contains the mean to something more less confusingcolnames(hays_summ) <-c("factora", "N", "mean", "sd", "sem", "CI" )# look at the productknitr::kable(hays_summ)
factora
N
mean
sd
sem
CI
control
10
27.4
5.103376
1.613829
3.650735
fast
10
21.2
5.202564
1.645195
3.721690
slow
10
21.8
2.529822
0.800000
1.809726
Initially, one should consider displaying the raw data points. This is the best way of having a sense of the dispersion of the DV within each group and follows Tufte’s axiom: “Show the Data”. The first code chunk defines the base bar graph and the succeeding one adds on another layer that contains the data points. Note that I changed the theme to one that removes the grid background found above in the boxplot.
Show/Hide Code
p1 <- ggplot2::ggplot(hays_summ, aes(x=as.factor(factora), y=mean)) +geom_bar(position=position_dodge(), stat="identity", fill="gray", width=.5) +xlab("Treatment Group") +ylab("Mean Number of Words") +theme_classic() +theme(text=element_text(size=12))#p1
Show/Hide Code
p2 <- p1 +geom_point(data=hays, aes(x=factora, y=dv))+ggtitle("The Effect of Visual and Auditory\n Stimulus Synchrony on Performance\n Means plus raw data points")p2
Next, the same base plot is used for the bars, but standard errors of the mean are added, +/- one standard error.
Show/Hide Code
# add on std error bars#win.graph() # or quartz() or x11()p3 <- p1 +geom_errorbar(aes(ymin=mean-sem, ymax=mean+sem),width=.2, # Width of the error barsposition=position_dodge(.9)) +ggtitle(" The Effect of Visual and Auditory\n Stimulus Synchrony on Performance\n Means +/- SEM")p3
A more finished ggplot2 graph might look like this:
Show/Hide Code
p3base <- ggplot2::ggplot(hays_summ, aes(x = factora, y = mean)) +geom_bar(position =position_dodge(),stat ="identity",fill ="gray",colour ="black",width=.5 ) +geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem),width = .2,# Width of the error barsposition =position_dodge(.9)) +scale_y_continuous(breaks =0:30*5) +xlab("Treatment Group") +ylab("Mean Number of Words") +ggtitle("The Effect of Visual and Auditory\n Stimulus Synchrony on Performance\n Means +/- SEM")p3base +theme_bw()
The reader might have noticed a call in this last ‘ggplot2’ graph called theme_bw(). ‘ggplot2’ can handle many different themes available in add-on packages. I use one called ggthemes.
Many other themes are available from the ggthemes package. see [https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html]
The reader might try any one of the following:
Show/Hide Code
p3base +theme_tufte()p3base +theme_grey()p3base +theme_dark()p3base +theme_economist()p3base +theme_excel() # very uglyp3base +theme_few() # from Stephen Fewp3base +theme_fivethirtyeight()p3base +theme_gdocs()p3base +theme_hc()p3base +theme_solarized()p3base +theme_stata()p3base +theme_wsj()p3base +theme_pander()
Sometimes, error bars are chosen to summarize the dispersion of the DV rather than as a noise estimator of the sample mean. One option (in addition to showing the raw data) is to use standard deviations from each group to form the error bars. The error bars extend up and down one standard deviation.
Show/Hide Code
# add on std error barsp4 <- p1 +geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd),width=.2, # Width of the error barsposition=position_dodge(.9)) +ggtitle(" The Effect of Visual and Auditory\n Stimulus Synchrony on Performance\n Means +/- SD")p4
A recommended approach is to show Confidence Intervals rather than SEM’s (see the toolkit bibliography section on confidence intervals). This is easy enough to accomplish since the summarySE function also computed the half width of a confidence interval so that it can be used in plotting the “error bars”.
Show/Hide Code
p5 <- p1 +geom_errorbar(aes(ymin=mean-CI, ymax=mean+CI),width=.2, # Width of the error barsposition=position_dodge(.9)) +ggtitle(" The Effect of Visual and Auditory\n Stimulus Synchrony on Performance\n Means plus 95%CI")p5
2.3.5 Miscl: Dotplots, Violin plots
One useful plot that we have seen before combines the violinplot (showing kernel density curves) with either a boxplot or, in this case, a dotplot. The violinplot is best used when sample sizes are a bit larger than the n=10 in this hays data set. Nonetheless, the capability is illustrated. This figure is generated using ggplot2 techniques.
I have become fond of a newer style of graph called a Rain Cloud Plot. It provides multiple views of the data, including a display of the raw data points. There are several ways to obtain Rain Cloud Plots and more are being developed all the time. The ones I prefer use ggplot2 and a few addons to the ggplot ecosystem. Typically they combine dotplots of raw data with boxplots and kernel density displays. Like many of the EDA techniques in this chapter, the small sample size of this textbook data set makes these plots with density functions less than highly useful.
The first is a simple set of code that uses ggplot add-ons from the ggdist and gghalves packages. I left it as black and white and shades of grey to make it nearly publication ready. I added the theme control for stylistic purposes.
Show/Hide Code
ggplot2::ggplot(hays, aes(factora, dv)) + ggdist::stat_halfeye(adjust = .5, width = .3, .width =0, justification =-.3, point_colour =NA) +geom_boxplot(width = .1, outlier.shape =NA) + gghalves::geom_half_point(side ="l", range_scale = .4, alpha = .5) +theme_classic()+theme(text=element_text(size=14), #change font size of all textaxis.text=element_text(size=14), #change font size of axis textaxis.title=element_text(size=14), #change font size of axis titlesplot.title=element_text(size=14), #change font size of plot titlelegend.text=element_text(size=14), #change font size of legend textlegend.title=element_text(size=14)) #change font size of legend title
Another approach involves uses a geom from the ggrain package. The horizontal layout may be preferred and this is the origin of the “raincloud” lable where the data points appear to fall from the raincloud (density function). A relatively non-offensive color scheme is chosen.
Show/Hide Code
ggplot2::ggplot(hays, aes(factora, dv, fill = factora)) + ggrain::geom_rain(alpha = .5, boxplot.args.pos =list(width =0.05, position =position_nudge(x =0.13)),violin.args.pos =list(side ="r",width =0.7, position =position_nudge(x =0.2))) +theme_classic() +scale_fill_brewer(palette ='Dark2') +guides(fill ='none', color ='none') +coord_flip()
Another is that I like is from the sadmr package. It also combines a boxplot, a kernel density plot and a jittered stripchart of the raw data.
2.3.7 Combining multiple ggplot figures into one layout
The next plot is an illustration of combining multiple graphs into one layout. First, we will use the hays data set, for continuity. But it is not very useful to draw histograms or kernel density functions with such a low sample size, so a second illustration uses the “cereals” data set.
Show/Hide Code
# Modeled after https://dmyee.files.wordpress.com/2016/03/advancedggplot.pdf#Histogram of DV, by treatment conditionp1<-ggplot2::ggplot(data = hays, aes(x = dv, fill=factora)) +geom_histogram(binwidth = .1) +scale_colour_grey() +scale_fill_grey() +xlab("Mean Number of Words") +ylab("Count") +ggtitle("Histograms, by treatment group") +theme_minimal() +theme(plot.title =element_text(size=10, face ="bold", hjust =1))# Boxplots of DV, by treatment conditionp2<-ggplot2::ggplot(data = hays, aes(x = factora, y = dv, fill=factora)) +geom_boxplot() +xlab("Treatment Group") +ylab("Number of Words") +scale_colour_grey() +scale_fill_grey() +ggtitle("Boxplots of DV by Treatment Group") +theme_minimal() +theme(plot.title =element_text(size=10, face ="bold", hjust =1))# Violin plots of DV, by treatment conditionp3<-ggplot2::ggplot(data = hays, aes(x = factora, y = dv, fill=factora)) +geom_violin(alpha=.25, color="gray") +geom_jitter(alpha=.5, aes(color=factora), position=position_jitter(width=0.3)) +scale_colour_grey() +scale_fill_grey() +coord_flip() +xlab("Treatment Group") +ylab("Number of Words") +ggtitle("Violin plots of DV by Treatment Group") +theme_minimal() +theme(plot.title =element_text(size=10, face ="bold", hjust =1))# Creating a matrix that defines the layout # (not all graphs need to take up the same space)lay <-rbind(c(1,2),c(3,3))# Plotting the plots on a gridgridExtra::grid.arrange(p1, p2, p3, ncol=2, layout_matrix=lay)
The second illustration uses the “cereals” data set that has been explored for previous course work with SPSS. Here, I chose the dependent variable to be the “Healthiness” rating of each cereal type and the categorical/grouping factor is shelf that the cereal was found on in the supermarket. This second illustration also introduces a way to control color for “fills”, by group, and uses a colorblind friendly palette of colors.
# Modeled after https://dmyee.files.wordpress.com/2016/03/advancedggplot.pdf#Histograms of "Healthiness" Rating" by Shelf#library("RColorBrewer")cbPalette <-c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")p1<-ggplot2::ggplot(data = cereals, aes(x = rating, fill=shelf)) +geom_histogram(binwidth = .1) +scale_fill_manual(values=cbPalette) +#scale_fill_brewer(palette="Paired") +#scale_colour_grey() + scale_fill_grey() +xlab("Rating") +ylab("Count") +ggtitle("Histograms, by Shelf") +theme_minimal() +theme(plot.title =element_text(size=10, face ="bold", hjust =1))# Boxplots of "Healthiness" Rating" by Shelfp2<-ggplot(data = cereals, aes(x = shelf, y = rating, fill=shelf)) +geom_boxplot() +xlab("Shelf") +ylab("Rating") +scale_fill_manual(values=cbPalette) +#scale_fill_brewer(palette="Paired") +#scale_colour_grey() + scale_fill_grey() +ggtitle("Boxplots of Rating by Shelf") +theme_minimal() +theme(plot.title =element_text(size=10, face ="bold", hjust =1))# Violin plots of "Healthiness" Rating" by Shelfp3<-ggplot2::ggplot(data = cereals, aes(x = shelf, y = rating, fill=shelf)) +geom_violin(alpha=.25, color="gray") +geom_jitter(alpha=.5, aes(color=shelf), position=position_jitter(width=0.3)) +scale_fill_manual(values=cbPalette) +scale_colour_manual(values=cbPalette) +#scale_fill_brewer(palette="Paired") +#scale_colour_grey() + scale_fill_grey() +coord_flip() +xlab("Shelf") +ylab("Rating") +ggtitle("Violin plots of Rating by Shelf") +theme_minimal() +theme(plot.title =element_text(size=10, face ="bold", hjust =1))# Creating a matrix that defines the layout # (not all graphs need to take up the same space)lay <-rbind(c(1,2),c(3,3))# Plotting the plots on a gridgridExtra::grid.arrange(p1, p2, p3, ncol=2, layout_matrix=lay)
2.3.8 A Caution about Error Bars and Confidence Intervals for Visualizing Inference
There is a common myth that in group comparison studies such as this that non-overlap of 95% CIs between a pair of groups indicates significance with a traditional two-sample test and that overlap indicates non-significance. The reality is different and a discussion of this is found in other course materials and R tutorials. While the former may be true in most circumstances, the latter is not when using a per comparison alpha of .05.
The issue raises the important point of just what the goal of data display is. If to provide an indicator of dispersion, then raw data points or error bars representing standard deviations might be preferred. If an indicator of sampling variation related to the mean is desired, then error bars as SEM’s or CI’s would be preferred. If the goal is “inference by eye”, then some modification of the CI strategy might be appropriate and that literature has been referred to above. More sophisticated graphs can include information about inference as well. One example of this is an easily produced plot from the ggbetweenstats function that is found in the ggstatsplot package. This graph may be too cluttered to be suitable for publication, but it is an example of how simple use of R functions can accomplish many things to facilitate exploratory data analysis and rudimentary inference.
Show/Hide Code
set.seed(123)p6 <- ggstatsplot::ggbetweenstats(data = hays,x = factora,y = dv,var.equal=T,p.adjust.method="fdr",#ggtheme="classical",title ="Distribution of Words Recalled across Treatment Condition")p6
Allaire, J., Xie, Y., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., … Iannone, R. (2018). Rmarkdown: Dynamic documents for r. Retrieved from https://CRAN.R-project.org/package=rmarkdown
Hays, W. L. (1994). Statistics (5th ed., pp. xviii, 1112 p.). Book, Fort Worth: Harcourt College Publishers.
Howell, D. C. (2013). Statistical methods for psychology (8th ed., pp. xix, 770 p.). Book, Belmont, CA: Wadsworth Cengage Learning.