This document is always undergoing revision. The content is good/accurate, but the style and layout are still being refined and a new sections added.
Note the reproducibility section at the end of the document to see which versions of R and packages were used to build this document.
Functions from several R packages are required:
library(knitr)
library(gt)
library(dplyr)
library(rmarkdown)
library(psych)
library(car)
library(broom)
library(bcdstats)
library(simpleboot)
library(lmtest)
library(nortest)
library(MASS)
library(bestNormalize)
library(rcompanion)
library(WRS2)
Functions kable
and gt
from
knitr and gt are used to provide
nicely formatted tables. The dplyr package is used to
provide tools for data wrangling.
The goal of this document is to provide a fairly comprehensive overview of basic linear modeling in R with a bivariate system of two quantitative variables - with a few extensions to more than two variables. See an accompanying document for linear modeling with larger numbers of IVs in the multiple regression framework. It can be a stand alone document for researchers in an early stage of training in correlation/regression theory while simultaneously learning R. However, the document is intended for use by APSY510 students at the University at Albany who have already worked through the conceptual and computational foundation material of bivariate correlation and simple regression. In addition, students will probably have worked through an SPSS implementation of these basics. The introduction to linear modeling in R will accomplish all the basics implemented in SPSS plus a few additional things.
The document contains R code as well as output (both text/tables and graphical) as well as some explanatory text.
Several data sets are used but one from the Howell textbook (Howell, 2014) is the primary data set. Howell’s table 9.2 (downloaded from the text’s website) has an example of the relationship between stress and mental health as reported in a study by Wagner, et al., (1988). This Health Psychology study examined a variable that was the subject’s perceived degree of social and environmental stress (called “stress” in the data set”). There was also a measure of psychological symptoms based on the Hopkins Symptom Checklist (called “symptoms”) in the data file.
First, we will import the data file. The data are found in a .csv file called “howell_9_2.csv”.
# read the csv file and create a data frame called data1
data1 <- read.csv(file="data/howell_9_2.csv")
# notice that the stress variable was read as a "factor".
#We need to change it to a numeric variable and will do the same for symptoms
str(data1)
## 'data.frame': 107 obs. of 3 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ stress : int 30 27 9 20 3 15 5 10 23 34 ...
## $ symptoms: int 99 94 80 70 100 109 62 81 74 121 ...
data1$stress <- as.numeric(data1$stress)
data1$symptoms <- as.numeric(data1$symptoms)
# attach the data file to make variable naming easier
attach(data1)
# show the first few lines
gt(headTail(data1))
id | stress | symptoms |
---|---|---|
1 | 30 | 99 |
2 | 27 | 94 |
3 | 9 | 80 |
4 | 20 | 70 |
... | ... | ... |
104 | 19 | 109 |
105 | 34 | 104 |
106 | 9 | 86 |
107 | 27 | 97 |
All analysis should begin with EDA of each variable. We will do
several things here. First we will examine univariate characteristics of
each of the two variables using the explore
function from
bcdstats.
explore(stress)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 107 21.29 12.49 20 20.49 11.86 1 58 57 0.64 -0.08 1.21
The stress variable doesn’t look very much like it came from a normal
distribution. It has a bit of positive skewness. The normal QQ plot from
the explore function can also show this departure from normality, as
does the kernel density function. But let’s introduce another graphing
function here to reinforce this perspective. The qqPlot
function in the car package gives us a nice normal
probility plot that has “confidence bands”. This permits evaluation of
the magnitude of devation from normality by asking how many data points
fall outside the 95% bands. For the stress variable, this is actually
smaller than the 5% of the 107 cases we might expect. So, the departure
from normality is not severe.
car::qqPlot(stress, id=F, distribution="norm")
explore(symptoms)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 107 90.33 18.81 88 88.87 17.79 58 147 89 0.77 0.6 1.82
And the qqPlot
application to the symptoms
variable……
car::qqPlot(symptoms, id=FALSE, distribution="norm")
Symptoms may be a bit more positively skewed, but neither variable are horrible departures from normality. Nonetheless, we may want to be careful to pay attention to the assumptions for tests of correlation and regression below.
There are many ways in R to generate bivariate scaterplots. Another document on “Scatterplots in R” has been created to show many of these possibilities. It can be found elsewhere in other class materials. Here, we will use one interesting approach with base system graphics that gives the scatterplot, rugplots for each variable, and boxplots in the respective margins for the two variables.
A scatterplot can be drawn with the most basic of R graphing
functions, the plot
function. Everything else shown here is
an add-on and not really necessary. Nonetheless, this additional
complexity of the figure creates a nice product that shows a great deal
of information. PLEASE REALIZE that you can draw a similar graph of any
two variables (your own data) simply by changing the variable names to
those of your data set (and the axis and graph title labels).
#win.graph(8,6)
par(fig=c(0,0.8,0,0.8))
plot(stress, symptoms,
xlab="Perceived Stress",
ylab="Mental Health Symptoms")
abline(lm(symptoms~stress, data=data1), col="red") # regression line (y~x)
rug(symptoms,side=4,col="grey")
rug(jitter(stress),side=1,col="grey")
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(stress, horizontal=TRUE, axes=FALSE)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(symptoms,axes=FALSE)
mtext("Scatterplot & Linear Regression Plus Univariate Boxplots", side=3, outer=TRUE, line=-3)
R provides basic functions for covariance and correlation as well as a test of the pearson product-moment correlation using the t-test that we learned.
First, the covariance……. We can pass the two variables of interest, or we could submit a data frame with larger numbers of variables and a variance-covariance matrix will be returned.
cov(symptoms,stress)
## [1] 123.2817
And the Pearson product-moment correlation coefficient…..
cor(symptoms, stress)
## [1] 0.5247429
The cor.test
function gives a test of a two-tailed
alternative hypothesis. One sided tests can be specified by changing the
“alternative” argument. In addition, tests of non-parametric correlation
coefficients (Kendal’s tau and Spearman’s Rho) can also be specified.
See the help page (?cor.test
) or a later section of this
document.
cor.test(symptoms,stress)
##
## Pearson's product-moment correlation
##
## data: symptoms and stress
## t = 6.3165, df = 105, p-value = 6.559e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3719397 0.6498580
## sample estimates:
## cor
## 0.5247429
To illustrate using cov
and cor
to produce
Variance-Covariance and Correlation matrices, we use the built-in R data
set called mtcars. It is a data set on performance specs of several
makes of automobiles.
data(mtcars)
gt(head(mtcars))
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
To keep the illustration simple, let’s extract a few of the
quantitative variables from this data set - some are more reasonably
looked at as as factors and we will leave them out. The
select
function from dplyr permits
specification of a subset of columns from mtcars. The commented out line
of code shows how to do the same thing with base R code (See the
accompanying document on subsetting (tutorial on bcdudek.net) in R for
more detail).
mtcars2 <- dplyr::select(mtcars, mpg, disp, hp, drat, wt)
#mtcars2 <- mtcars[,c(1,3:6)]
gt(headTail(mtcars2))
mpg | disp | hp | drat | wt |
---|---|---|---|---|
21 | 160 | 110 | 3.9 | 2.62 |
21 | 160 | 110 | 3.9 | 2.88 |
22.8 | 108 | 93 | 3.85 | 2.32 |
21.4 | 258 | 110 | 3.08 | 3.21 |
... | ... | ... | ... | ... |
15.8 | 351 | 264 | 4.22 | 3.17 |
19.7 | 145 | 175 | 3.62 | 2.77 |
15 | 301 | 335 | 3.54 | 3.57 |
21.4 | 121 | 109 | 4.11 | 2.78 |
Now we can see how cov
and cor
handle
larger numbers of variables. The result from the cov
function is a variance-covariance matrix with variances of the variables
on the leading diagonal and covariances off-diagonal. Note that it is a
symmetrical matrix.
I like the gt
function to make nicer tables in markdown
docs but it needs to work on tables or data frames, not a matrix - so I
just converted the covariance matrix to a dataframe.
# use the round function to control the number of decimal places displayed.
cov2 <- round(cov(mtcars2),digits=3)
gt(as.data.frame(cov2))
# check that the diagonal contains variances
#var(mtcars2$wt)
mpg | disp | hp | drat | wt |
---|---|---|---|---|
36.324 | -633.097 | -320.732 | 2.195 | -5.117 |
-633.097 | 15360.800 | 6721.159 | -47.064 | 107.684 |
-320.732 | 6721.159 | 4700.867 | -16.451 | 44.193 |
2.195 | -47.064 | -16.451 | 0.286 | -0.373 |
-5.117 | 107.684 | 44.193 | -0.373 | 0.957 |
A correlation matrix is a standardized covariance matrix, and is also symmetrical.
# use the round function to control the number of decimal places displayed.
cor2 <- round(cor(mtcars2),digits=2)
gt(as.data.frame(cor2))
mpg | disp | hp | drat | wt |
---|---|---|---|---|
1.00 | -0.85 | -0.78 | 0.68 | -0.87 |
-0.85 | 1.00 | 0.79 | -0.71 | 0.89 |
-0.78 | 0.79 | 1.00 | -0.45 | 0.66 |
0.68 | -0.71 | -0.45 | 1.00 | -0.71 |
-0.87 | 0.89 | 0.66 | -0.71 | 1.00 |
Prediction of symptoms from stress levels requires an R function that
is called lm
. Much like the SPSS regression procedure, it
is capable of expanding to larger number of IVs in multiple regression
and other uses. Here, we simply specify symptoms as the DV and stress as
the IV, using a model syntax that you have seen before, using the tilda
symbol (~).
Unlike SPSS, the lm function itself only generates the model object. Other functions are required to obtain the detailed summary information. The two additional basic functions are anova and summary. More detailed work will follow below the basics presented here in this section.
The summary
function produces a table with the estimates
of the regression coefficients, and their standard errors, along with
the t-tests for both the intercept and the slope. The test of the
intercept is a test of the null that the intercept is zero. The Residual
Standard error is the square root of MSresidual (see the Anova summary
table below).
# fit the model and save it as an objec that we choose to call fit1
fit1 <- lm(symptoms~stress)
# summarize the model
summary(fit1)
##
## Call:
## lm(formula = symptoms ~ stress)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.270 -11.222 -0.778 7.037 47.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.5066 3.0838 23.837 < 2e-16 ***
## stress 0.7901 0.1251 6.317 6.56e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.09 on 105 degrees of freedom
## Multiple R-squared: 0.2754, Adjusted R-squared: 0.2685
## F-statistic: 39.9 on 1 and 105 DF, p-value: 6.559e-09
We will need to request the confidence interval for the regression
coefficient. The confint
function provides one for the
interecept also.
#obtain a 95%CI for the regression coefficients
confint(fit1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 67.3920818 79.621162
## stress 0.5420636 1.038087
The anova
function gives the SS partitioning and the
same basic ANOVA summary table we saw in SPSS. Note that the F value is
the same as above, and the square of the t for the test of the
regression coefficient.
anova(fit1)
## Analysis of Variance Table
##
## Response: symptoms
## Df Sum Sq Mean Sq F value Pr(>F)
## stress 1 10325 10324.6 39.899 6.559e-09 ***
## Residuals 105 27171 258.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R provides a direct and simple method to obtain several important
diagnostic plots for regression objects. The two figures on the left are
similar to those we obtained from SPSS. The two on the right involve
things we will get to next semester. The plot
function is
aware of model objects that are lm
fits. It “knows” to
generate the following four types of graphs simply by passing the
lm
fit object to plot
. The first plot suggests
the presence of some heteroscedasticity. With larger yhat values, the
spread in the residuals is slightly larger. We will return to this in a
later section, and look at tests of homoscedasticity. The normal QQ plot
does suggest some non-normality of the residuals and we will also
examine this further in the later sections.
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit1)
In SPSS we viewed the scatterplot of residuals against predicted in standardized form rather than raw scale form as was the case just above. We might also want to obtain a frequency histogram of the residuals, and perhaps more sophisiticated plots to evaluate the residuals. This section shows how to extract residuals and yhats from a regression model fit object, a method to standardize them, and use of them in plots to evaluate the normality and homoscedasticity assumptions.
## work with residuals and predicted scores a bit more
# create new vectors for residuals/yhats, also standardized
# and then add them to the original data frame by
# using the cbind function
fit1.resid <- residuals(fit1)
fit1.pred <- predict(fit1)
# we can use the scale funciton in R to
# center and standardize variables. Here, I standardize.
fit1.zresid <- scale(fit1.resid, scale=T)
fit1.zpred <- scale(fit1.pred, scale=T)
detach(data1) # need to detach it before adding in the new vars
# use the "column bind" function to add the newly created variables to the original data frame.
data2 <- cbind(data1,fit1.resid,fit1.pred,fit1.zresid,fit1.zpred)
Next, examine the new data frame which now contains the raw and standardized residuals and yhats (predicted scores).
# look at the first few lines of the new data frame
head(data2)
# attach it for use below
attach(data2)
## The following objects are masked _by_ .GlobalEnv:
##
## fit1.pred, fit1.resid, fit1.zpred, fit1.zresid
## id stress symptoms fit1.resid fit1.pred fit1.zresid fit1.zpred
## 1 1 30 99 1.7911200 97.20888 0.11187296 0.6972958
## 2 2 27 94 -0.8386541 94.83865 -0.05238215 0.4571328
## 3 3 9 80 -0.6172992 80.61730 -0.03855637 -0.9838455
## 4 4 20 70 -19.3081272 89.30813 -1.20598139 -0.1032477
## 5 5 3 100 24.1231525 75.87685 1.50672681 -1.4641716
## 6 6 15 109 23.6422492 85.35775 1.47668969 -0.5035194
Now we can use those newly extracted/created variables to draw useful plots and perform other analyses. We have seen each of these types of plots previously. Taken together they suggest a possible normality issue with the residuals. Some heteroscedasticity might be present since there is a bit higher spread of the residuals at higher yhat values
#win.graph(8,6)
layout(matrix(c(1,3,2,4,5,6),2,3)) #optional multiple graphs/figure
boxplot(fit1.resid,xlab="Residuals",data=data2,col="lightgrey")
boxplot(fit1.pred, xlab="Yhat",data=data2,col="lightgrey")
hist(fit1.zresid,breaks=12,prob=T)
lines(density(fit1.zresid),col = "red",lwd = 3)
hist(fit1.zpred,breaks=12,prob=T)
lines(density(fit1.zpred),col = "red",lwd = 3)
plot(fit1.zpred,fit1.zresid)
The histogram of the residuals shows some slight positive skew (.54, seen from the numerical calculation below). A more interesting plot to examine the fit of the residuals to a normal distribution is to overlay a normal distribution curve onto the frequency histogram - This plot examines the standardized residuals, but the raw residuals would serve just as well. This simple function from the rcompanion package is useful in many different circumstances as well.
#library(rcompanion)
plotNormalHistogram(fit1.zresid)
It is also possible to obtain numerical summary info on these new
varaibles (and all original ones) using describe
from the
psych
package. The residuals are slightly positively
skewed. We might consider a scale transformation to remedy this
kable(round(describe(data2,type=2), digits=2))
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | 1 | 107 | 54.00 | 31.03 | 54.00 | 54.00 | 40.03 | 1.00 | 107.00 | 106.00 | 0.00 | -1.20 | 3.00 |
stress | 2 | 107 | 21.29 | 12.49 | 20.00 | 20.49 | 11.86 | 1.00 | 58.00 | 57.00 | 0.64 | -0.08 | 1.21 |
symptoms | 3 | 107 | 90.33 | 18.81 | 88.00 | 88.87 | 17.79 | 58.00 | 147.00 | 89.00 | 0.77 | 0.60 | 1.82 |
fit1.resid | 4 | 107 | 0.00 | 16.01 | -0.78 | -0.87 | 12.72 | -38.27 | 47.42 | 85.69 | 0.56 | 0.49 | 1.55 |
fit1.pred | 5 | 107 | 90.33 | 9.87 | 89.31 | 89.70 | 9.37 | 74.30 | 119.33 | 45.03 | 0.64 | -0.08 | 0.95 |
fit1.zresid | 6 | 107 | 0.00 | 1.00 | -0.05 | -0.05 | 0.79 | -2.39 | 2.96 | 5.35 | 0.56 | 0.49 | 0.10 |
fit1.zpred | 7 | 107 | 0.00 | 1.00 | -0.10 | -0.06 | 0.95 | -1.62 | 2.94 | 4.56 | 0.64 | -0.08 | 0.10 |
Now that we have extracted the yhats (predicted scores), we can revisit the initial bivariate scatterplot and redraw it with additional information. Displaying the yhats as a rug plot gives a useful visual perspective.
par(fig=c(0,0.8,0,0.8))
rangex <- max(data2$stress)- min(data2$stress)
rangey <- max(data2$symptoms)- min(data2$symptoms)
lft <- min(data2$stress)-(.08*rangex)
rght <- max(data2$stress)+(.03*rangex)
yhatpos <- max(data2$stress)+(.055*rangex)
plot(stress, symptoms,
xlab="Stress",
ylab="Symptoms",
xlim=c(lft,rght))
abline(lm(symptoms~stress), col="black") # regression line (y~x)
rug(jitter(symptoms),side=4,col="black")
rug(jitter(stress),side=1)
rug(jitter(fit1.pred),side=4,col="red", pos=yhatpos)
text((min(data2$stress) + (.75*rangex)),
(min(data2$symptoms)+(.05*rangey)),
paste("Red Rugplot is Yhat values"), cex=.7, col="red")
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(stress, horizontal=TRUE, axes=FALSE)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(symptoms,axes=FALSE)
mtext("Scatterplot & Linear Regression Plus Univariate Boxplots", side=3, outer=TRUE, line=-3)
Notice that the yhats are not as dispersed as the Y values. This is to be expected and reflects the “regression to the mean” typical of OLS. When the correlation is 1.0, then the yhats will match the Y values. When the xy correlation is zero, then there is now variation in the yhats; they will all equal the mean of Y. Also recall that \({{r}_{y\overset{\scriptscriptstyle\frown}{y}}}={{r}_{xy}}\). The reader can explore yhat distributions, interactively, in this shiny app:
The plot of residuals against yhats was generated above with the
plot(fit1)
code. With the extracted zresids and zpreds, we
can generate that plot quickly. The plot gives somewhat of a hint that
the residual variance might be heteroscedastic. For positive yhat
values, the residual variation (vertical spread in the graph) seems to
be larger. However, below, we will examine a way to test it.
plot(fit1.zpred, fit1.zresid)
Testing the homoscedasticity assumption in R can be accomplished several ways. The two most common are the Brausch-Pagan test, and the NCV test (non-constant variance)
lmtest::bptest(fit1)
##
## studentized Breusch-Pagan test
##
## data: fit1
## BP = 6.1457, df = 1, p-value = 0.01317
car::ncvTest(fit1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.407101, Df = 1, p = 0.0064967
Testing the assumption of residual normality can be accomplished with many different tests in R. One very commonly used test is the Shapiro-Wilk test. The slight non-normality visualized with the frequency histogram of the residuals and skewness computation above is found to be significant with this test. We can also test the original DV.
shapiro.test(fit1.resid)
shapiro.test(symptoms)
##
## Shapiro-Wilk normality test
##
## data: fit1.resid
## W = 0.97043, p-value = 0.01727
##
##
## Shapiro-Wilk normality test
##
## data: symptoms
## W = 0.95918, p-value = 0.00232
A test of normality that has been highly recommended is the anderson-darling test and it is available in the nortest package.
#library(nortest)
ad.test(fit1.resid)
##
## Anderson-Darling normality test
##
## data: fit1.resid
## A = 1.1929, p-value = 0.003932
ad.test(symptoms)
##
## Anderson-Darling normality test
##
## data: symptoms
## A = 0.93263, p-value = 0.01738
Even though the degree of heteroscedasticity and residual non-normality was not extreme, there is some indication that these crucial assumptions might not be satisfied in the bivariate population system of these two variables. It is important to be able to evaluate these assumptions both graphically and inferentially. But what options does one have if the assumptions are violated? Several possibilities exist: scale transformation of the DV and perhaps the IV; Robust methods; resampling methods such as bootstrapping. Sections on these topics are found below.
Foreshadowing things we will do next semester, we can obtain
statistics on “influence analysis” for this model. This simply
introduces the notion that influence analysis in Multiple regression is
an integral part of the basic analysis and is included here as a
placeholder to remind us of that next semester. The
influence.measures
function provides several indices of
influence - the output is long so is not included here. The
influencePlot
function from car provides a
revision of the residuals vs yhat plot that scales the points to the
size of influence for that case.
# This is a standard plot of residuals agains yhats.
# The area of the points is proportional to the degree of influence of each case and
# this topic will be covered later. Cook's D is used to scale the points.
#influence.measures(fit1)
influencePlot(fit1)
Two general options for scale transformations exist. One is to search for scale transformations of Y and X that best produce a sample distribution that approximates a normal. This might implement usage of Tukey’s “ladder of powers” or it might be based on more advanced criteria. A second major approach is to use the Box-Cox transformation algorithm.
Historically, use of transformations in regression is most commonly applied to situations where positive skewness exists in the variables, at times a result of the properties of the measurement instrument. Researchers have most often resorted to using either a square root transformation or a base 10 log transformation. At times other transformations are tried, including natural logs and other fractional exponents, implementing Tukey’s ladder of powers, including fractional and negative powers (e.g., an exponent of .5 gives the square root, but a -.5 could also be considered). These have typically been chosen with a trial and error approach. Some new capabilities in R have attempted to automate this approach and one is outlined here.
The modern data science world has a need for normalizing variables
that are used in machine learning predictive techniques (which are
ultimately an extensive elaboration of simple regression to multiple
predictors in large data sets with many potential IVs). There is an R
package that implements several approaches to finding good scale
transformations for dependent and independent variables (sometimes
called covariates in the data science world). One package attempts to
put together several different approaches and compare them to produce a
recommendation about the best transformation. The goal is to produce a
scale change that yeilds the best fit to a normal distribution. Use of
this bestNormalize
functions should only be done after a
careful reading of the package vignette.
https://cran.r-project.org/web/packages/bestNormalize/vignettes/bestNormalize.html
These approaches are atheoretical and as such are subject to criticism. Some caution is also advised since results of this illustration can change if the random number seed is changed. Smiply re-running this same code without setting the seed can produce a different recommendation. Nonetheless, it is an interesting possibility to save time and consider many possible transformations.
The bestNormalize
function works on one variable at a
time and the DV is illustrated here. I have set the random number seed
here to produce a replicable outcome for purposes of this document. Note
that the best recommendation for transforming the scale of the DV
(symptoms) is actually a square root transformation. Comparative
information for a larger class of posssible transformations is also
provided.
#install.packages("bestNormalize")
#library(bestNormalize)
set.seed(12354)
bestNormalize(symptoms, allow_lambert_s = T)
## Best Normalizing transformation with 107 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 1.0232
## - Box-Cox: 1.1417
## - Center+scale: 1.0762
## - Double Reversed Log_b(x+a): 1.7664
## - Exp(x): 12.379
## - Lambert's W (type s): 1.1635
## - Log_b(x+a): 1.0232
## - orderNorm (ORQ): 1.2064
## - sqrt(x + a): 0.9366
## - Yeo-Johnson: 1.1417
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized sqrt(x + a) Transformation with 107 nonmissing obs.:
## Relevant statistics:
## - a = 0
## - mean (before standardization) = 9.455023
## - sd (before standardization) = 0.9687161
#bestNormalize::boxcox(symptoms)
#bestNormalize(stress)
The Box-Cox tranformation capability is typically not applied to a
single variable in isolation. Most commonly, the Box-Cox approach is
designed to find a transformation of the DV that best results in
residual normality of a regression model. So, an lm
fit
object is passed to the boxCox
function - several packages
in R permit this approach and the one used here is from the
car package. In regression, the critical normality
assumption is normality of the residuals. Although normality of both IV
and DV in a bivariate normality distribution is sufficient to produce
residual normality, it is not necessary. It is the DV distribution shape
that most directly influences the residual distribution shape. Thus the
IV distribution normality is typically not an issue of interest in
regression work.
The function produces a graph that is not interpretable without some explanation. The x axis reflects various possible \(\lambda\) values to be used in the transformation. The actual transformation is
\[{{y}_{BCtransformed}}=\frac{{{y}^{\lambda }}-1}{\lambda }\]
If \(\lambda\) is zero, then a base 10 log transform is used.
This approach assumes that all Y values are positive. If some/all y’s are negative the user could simply add the minimum value (plus 1) to all scores. But most software implementations incorporate a Box/Cox recommendation for an alternative formulation when some Y values are negative.
The Y axis of the graph provides an index that reflects how well the \(\lambda\) value produces a satisfactory regression fit with residual normality. In our example, the plot seems to peak when \(\lambda\) is approximately -.25. But the plot also gives a range of values in which confidence is high that the \(\lambda\) value produces reasonable normality. The points at which the parabola intersects this 95% confidence level provides a range of possible \(\lambda\) values to use. It is usually recommended to use a “nice” value near some interpretable value such as a fraction. -.3 is close to -1/3 or -1/4, so those could be a good choice. But zero might be a better choice and that would result in doing a log 10 transformation. Some subjectivity is obviously involved.
# require(car)
boxCox(fit1)
But how can we find the actual exact recommended \(\lambda\) from the peak of the curve. If we
save the boxCox
object, it is a data frame that has the x
and y values. If we sort the data frame by the y axis values, we can
extract the x value associated with that highest y value.
bc1 <- as.data.frame(boxCox(fit1))
bc1 <- dplyr::arrange(bc1, -y) # sort to smallest Y first
head(bc1)
## x y
## 1 -0.3434343 -540.2885
## 2 -0.3838384 -540.2921
## 3 -0.3030303 -540.2953
## 4 -0.4242424 -540.3059
## 5 -0.2626263 -540.3125
## 6 -0.4646465 -540.3300
lambda <-bc1[1,1]
lambda
## [1] -0.3434343
Next we can create a transformed symptoms variable that uses the \(\lambda\) from the BoxCox analysis. If we create that variable and add it into the data frame, we can refit the regression model and examine the residuals.
tsymptoms <- ((symptoms^lambda)-1)/lambda
data3 <- cbind(data1,tsymptoms)
gt(head(data3))
id | stress | symptoms | tsymptoms |
---|---|---|---|
1 | 30 | 99 | 2.310886 |
2 | 27 | 94 | 2.300096 |
3 | 9 | 80 | 2.265263 |
4 | 20 | 70 | 2.234924 |
5 | 3 | 100 | 2.312957 |
6 | 15 | 109 | 2.330420 |
Now refit the regression.
fit1bc <- lm(tsymptoms~stress, data=data3)
broom::tidy(fit1bc)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.25 0.00707 318. 1.80e-158
## 2 stress 0.00183 0.000287 6.37 5.08e- 9
And now we can examine the residual distribution from the new regression analysis.
plotNormalHistogram(residuals(fit1bc))
kable(round(describe(residuals(fit1bc)),digits=2))
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 107 | 0 | 0.04 | 0 | 0 | 0.03 | -0.09 | 0.08 | 0.17 | -0.02 | -0.36 | 0 |
And the plot of residuals against yhats seems to look less heteroscedastic.
plot(predict(fit1bc), residuals(fit1bc))
ncvTest(fit1bc)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.4601014, Df = 1, p = 0.49758
For data systems where bivariate normality is not present and where
outliers are influential, there exist a set of so-called robust
procedures that yield better estimates of the population quantities
and/or their standard errors. It is not recommended to use these methods
without considerable background in their theory and implementation. They
are included here to show the ease of implementation in R. The first two
methods shown here are well covered in Wilcox (2017). The third is what is typically
called Robust regression and is a suite of methods available via the
rlm
function in the MASS package.
The data set we have been working with does not have extreme outliers with extraordinary influence, so submission of this data set to these robust methods may not be necessary. Nonetheless, the same data set is used for ease of illustration.
First we can repeat our OLS and pearson calculations for the sake of comparison.
# for comparison, repeat earlier analyses
summary(fit1)
##
## Call:
## lm(formula = symptoms ~ stress)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.270 -11.222 -0.778 7.037 47.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.5066 3.0838 23.837 < 2e-16 ***
## stress 0.7901 0.1251 6.317 6.56e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.09 on 105 degrees of freedom
## Multiple R-squared: 0.2754, Adjusted R-squared: 0.2685
## F-statistic: 39.9 on 1 and 105 DF, p-value: 6.559e-09
anova(fit1)
## Analysis of Variance Table
##
## Response: symptoms
## Df Sum Sq Mean Sq F value Pr(>F)
## stress 1 10325 10324.6 39.899 6.559e-09 ***
## Residuals 105 27171 258.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cor.test(symptoms, stress)
##
## Pearson's product-moment correlation
##
## data: symptoms and stress
## t = 6.3165, df = 105, p-value = 6.559e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3719397 0.6498580
## sample estimates:
## cor
## 0.5247429
The first robust method is called a “percentile bend correlation coefficient. The”beta” argument is a “bending” coefficient. The correlation produced here is not much different than the pearson coefficient. See the comment about the p value being rounded to zero.
#library(WRS2)
# By just executing the function and examining the values returned,
# the p value can be rounded to zero, which is non-sensical,
# so it is best to create the object and then look at its structure,
# where the p value is not rounded.
pb1 <- pbcor(symptoms,stress, beta=.1)
pb1
## Call:
## pbcor(x = symptoms, y = stress, beta = 0.1)
##
## Robust correlation coefficient: 0.5253
## Test statistic: 6.3255
## p-value: 0
str(pb1)
## List of 7
## $ cor : num 0.525
## $ test : num 6.33
## $ p.value: num 6.29e-09
## $ n : int 107
## $ cor_ci : logi NA
## $ alpha : num 0.05
## $ call : language pbcor(x = symptoms, y = stress, beta = 0.1)
## - attr(*, "class")= chr "pbcor"
A traditional approach to handling the influence of outliers is
generically called “trimming”, and Winsorization is one type.
Winsorizing the variables can reduce the impact of extreme outliers. As
an example, trimming 10% with winsorization involves the following. For
each of the two variables, the lowest 5% of the values are set equal to
the 5th percentile. And in the upper tail the highest 5% of the values
are set to the 95th percentile. The remainder of the scores are
unchanged. Then proceed to calculate a normal pearson correlation
coefficient with these altered x and y variables. This process is
handled by the wincor
function from the
WRS2 package.
wincor(symptoms,stress, tr=.1)
## Call:
## wincor(x = symptoms, y = stress, tr = 0.1)
##
## Robust correlation coefficient: 0.5104
## Test statistic: 6.0822
## p-value: 0
The third method uses the rlm
function and is a well
developed suite of robust techniques. The method employs iterated
re-weighted least squares (yes, this is why it is recommended to use the
method only after working through the theory). M-estimation is employed
for parameter estimation and the default method is the one developed by
Huber. Alternative M-estimation is available in methods from Tukey and
Hampl and are the two commented out lines of code. Note that the
standard error of the regression coefficient is smaller than for the OLS
analysis above, as is the residual standard error (compare to the square
root of MS residual from the OLS fit above). Rather than take additional
space in this document, the reader is referred to a very good tutorial
page on IRLS using rlm
. This tutorial page is from the UCLA
Statistical Consulting unit which has created many useful tutorials.
https://stats.idre.ucla.edu/r/dae/robust-regression/
This Robust Regression methodology extends easily to multiple regression as well.
#library(MASS)
fit1.rlm <- rlm(symptoms~stress)
#fit1.rlm <- rlm(symptoms~stress, psi="psi.bisquare")
#fit1.rlm <- rlm(symptoms~stress, psi="psi.hampel")
summary(fit1.rlm)
##
## Call: rlm(formula = symptoms ~ stress)
## Residuals:
## Min 1Q Median 3Q Max
## -36.9574 -9.9113 0.5336 8.3485 48.7329
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 72.1960 2.9674 24.3293
## stress 0.7900 0.1204 6.5638
##
## Residual standard error: 12.96 on 105 degrees of freedom
Standard methods of computing non-parametric versions of the
correlation coefficient are easily obtained with a “method” argument in
the base cor.test
function.
The Spearman’s Rho and Kendall’s tau coefficients are produced. Note that the p value for the Spearman’s Rho test is an approximate value in the presence of tied ranks. The x and y data are ranked for these methods.
cor.test(symptoms, stress, method="spearman")
## Warning in cor.test.default(symptoms, stress, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: symptoms and stress
## S = 99047, p-value = 1.399e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5148488
cor.test(symptoms, stress, method="kendall")
##
## Kendall's rank correlation tau
##
## data: symptoms and stress
## z = 5.6887, p-value = 1.28e-08
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.3789991
Bootstrapping provides a resampling based estimate of the standard
error of the slope (and intercept), rather than the normal-distribution
based estimate that we have previously used. It is a go-to method when
normality assumptions are not met, as appears to be the case in this
symptoms~stress illustration Since we have not extensively discussed
bootstrapping yet, this treatment will be somewhat superficial. However,
the primary goal is to provide a different estimate of the standard
error of b (slope) that can then be used in a confidence interval to
evaluate an inference about the population slope. Notice that the
bootstrapped standard error (sd of b) here is a bit larger than the std
error of the slope produced with the original lm
fit (std
error there was .1251).
set.seed(119451)
boot1.fit1 <- lm.boot(fit1,R=3000)
summary(boot1.fit1)
## BOOTSTRAP OF LINEAR MODEL (method = rows)
##
## Original Model Fit
## ------------------
## Call:
## lm(formula = symptoms ~ stress)
##
## Coefficients:
## (Intercept) stress
## 73.5066 0.7901
##
## Bootstrap SD's:
## (Intercept) stress
## 2.7736074 0.1377281
Once this bootstrap model is obtained, it can be used to redraw the scatterplot with bootstrap-based 95% confidence envelope around the regression line.
plot(boot1.fit1,
main="Bootstrapped estimates",
ylab="Symptoms",
xlab="Perceived Stress")
Additional capabilities permit detailed graphical and numerical examination of the bootstrapped samples. The bootstrapped samples can be saved into a named object (sb1 here). The first row of that object contains the intercept estimates for the 3000 replicates and the second row contains the slope estimates.
# extract the coefficients from the bootstrap replications
sb1 <- samples(boot1.fit1,name="coef")
describe(sb1[1,],type=2) # the intercept
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 3000 73.47 2.77 73.46 73.45 2.73 64.23 83.37 19.14 0.07 -0.04
## se
## X1 0.05
describe(sb1[2,],type=2) # the slope
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3000 0.79 0.14 0.79 0.79 0.14 0.24 1.32 1.08 0.03 0.01 0
Next plot frequency histograms of the intercept and the regression coefficients. These are the empirical boostrapped distributions of the values from the 3000 replicates
#these are the bootstrapped sampling distributions
hist(sb1[1,], main="Bootstrapped Sampling Distribution of Intercept")
hist(sb1[2,], main="Bootstrapped Sampling Distribution of Regression Coefficient")
Now that we have these bootstrapped sample values of the slope, we
can find the empirical 95% limits. The quantile
function
finds such percentiles. the bootstrapped slope values are found in the
`sb1 dataframe, as the second variable (sb1[2,]).
quantile(sb1[2,], c(.025, .975))
## 2.5% 97.5%
## 0.5164381 1.0670908
So, our slope estimate is .79 and the bootstrapped CI ranges from .52 to 1.07. Since it does not overlap zero, we can reject the null hypothesis that beta is equal to zero (two-tailed alpha=.05). The bootstrapped confidence interval is slightly wider than the original one produced with the first examination of the fit1 model, as expected given the non-normality of the residuals and the concommitant type I error rate inflation with traditional OLS.
Next, we might even submit these bootstrapped values to our
explore
function from the bcdstats
package for
more graphical and numerical evaluation.
#library(bcdstats)
#explore(sb1[1,]) #intercept
explore(sb1[2,]) #slope
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 3000 0.79 0.14 0.79 0.79 0.14 0.24 1.32 1.08 0.03 0.01 0
R software products such as this markdown document should be simple to reproduce, if the code is available. But it is also important to document the exact versions of the R installation, the OS, and the R packages in place when the document is created.
sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lamW_2.2.0 WRS2_1.1-4 rcompanion_2.4.30
## [4] bestNormalize_1.9.0 MASS_7.3-60 nortest_1.0-4
## [7] lmtest_0.9-40 zoo_1.8-12 simpleboot_1.1-7
## [10] bcdstats_0.0.0.9005 broom_1.0.5 car_3.1-2
## [13] carData_3.0-5 psych_2.3.6 rmarkdown_2.24
## [16] dplyr_1.1.2 gt_0.9.0 knitr_1.43
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.1 later_1.3.1 tibble_3.2.1
## [4] cellranger_1.1.0 hardhat_1.3.0 rpart_4.1.19
## [7] lifecycle_1.0.3 rstatix_0.7.2 doParallel_1.0.17
## [10] mc2d_0.2.0 globals_0.16.2 lattice_0.21-8
## [13] yhat_2.0-3 backports_1.4.1 magrittr_2.0.3
## [16] vcd_1.4-11 Hmisc_5.1-0 sass_0.4.7
## [19] jquerylib_0.1.4 yaml_2.3.7 plotrix_3.8-2
## [22] shinyBS_0.61.1 httpuv_1.6.11 doRNG_1.8.6
## [25] LambertW_0.6.7-1 gld_2.6.6 RColorBrewer_1.1-3
## [28] lubridate_1.9.2 multcomp_1.4-25 abind_1.4-5
## [31] expm_0.999-7 purrr_1.0.2 nnet_7.3-19
## [34] TH.data_1.1-2 sandwich_3.0-2 ipred_0.9-14
## [37] lava_1.7.2.1 listenv_0.9.0 MatrixModels_0.5-2
## [40] parallelly_1.36.0 codetools_0.2-19 coin_1.4-2
## [43] xml2_1.3.5 tidyselect_1.2.0 gmp_0.7-2
## [46] dabestr_0.3.0 matrixStats_1.0.0 stats4_4.3.1
## [49] base64enc_0.1-3 butcher_0.3.2 jsonlite_1.8.7
## [52] e1071_1.7-13 ellipsis_0.3.2 Formula_1.2-5
## [55] survival_3.5-7 iterators_1.0.14 foreach_1.5.2
## [58] tools_4.3.1 miscTools_0.6-28 DescTools_0.99.49
## [61] yacca_1.4-2 Rcpp_1.0.11 glue_1.6.2
## [64] mnormt_2.1.1 prodlim_2023.03.31 gridExtra_2.3
## [67] xfun_0.40 withr_2.5.0 fastmap_1.1.1
## [70] latticeExtra_0.6-30 boot_1.3-28.1 fansi_1.0.4
## [73] SparseM_1.81 digest_0.6.33 timechange_0.2.0
## [76] R6_2.5.1 mime_0.12 colorspace_2.1-0
## [79] Cairo_1.6-0 jpeg_0.1-10 utf8_1.2.3
## [82] tidyr_1.3.0 generics_0.1.3 data.table_1.14.8
## [85] recipes_1.0.7 class_7.3-22 httr_1.4.7
## [88] htmlwidgets_1.6.2 pkgconfig_2.0.3 gtable_0.3.3
## [91] Exact_3.2 timeDate_4022.108 modeltools_0.2-23
## [94] Rmpfr_0.9-3 HH_3.1-49 htmltools_0.5.6
## [97] pwr_1.3-0 multcompView_0.1-9 scales_1.2.1
## [100] lmom_2.9 Rmisc_1.5.1 leaps_3.1
## [103] png_0.1-8 gower_1.0.1 rstudioapi_0.15.0
## [106] reshape2_1.4.4 checkmate_2.2.0 nlme_3.1-163
## [109] proxy_0.4-27 cachem_1.0.8 stringr_1.5.0
## [112] rootSolve_1.8.2.3 KernSmooth_2.23-22 parallel_4.3.1
## [115] libcoin_1.0-9 foreign_0.8-84 reshape_0.8.9
## [118] pillar_1.9.0 grid_4.3.1 vctrs_0.6.3
## [121] ggpubr_0.6.0 promises_1.2.1 xtable_1.8-4
## [124] cluster_2.1.4 htmlTable_2.4.1 evaluate_0.21
## [127] mvtnorm_1.2-2 cli_3.6.1 compiler_4.3.1
## [130] rlang_1.1.1 rngtools_1.5.2 ggsignif_0.6.4
## [133] future.apply_1.11.0 interp_1.1-4 plyr_1.8.8
## [136] stringi_1.7.12 deldir_1.0-9 munsell_0.5.0
## [139] quantreg_5.96 Matrix_1.6-1 future_1.33.0
## [142] ggplot2_3.4.3 shiny_1.7.5 highr_0.10
## [145] RcppParallel_5.1.7 bslib_0.5.1 readxl_1.4.3