Chapter 13 Extensions to larger models and to categorical IVs
In this chapter, the goals are to extend the suite of methods to models with additional characteristics. This includes a model with more than two IVs but we will also consider using a categorical IV toward the end. At the same time, this chapter provides some emphasis on strong capabilities in rmarkdown and the “tidyverse” to generate nicer tables. Some of that has been seen in other chapters with the kable
and gt
functions, but it is expanded here.
The capability of rmarkdown, bookdown and knitr to produce a document such as this one is accompanied by increased options for formatting the output. The typical text output from R functions can be converted to style/fonts that are more readable. A few brief examples are illustrated in this section. The kable function permits enhancement of tables such as those coming from summary
and anova
. But capabilities in the broom package also permit enhancement of the standard output coming from `summary. The reader should recognize that whole documents/manuscripts can be written in rmarkdown, even in APA style
13.1 Evaluate a 3-IV model
We will use the same data set that we used above with the enhanced output and formatting. However we will extend the analysis to three IVs. The reader should take note of the fact that adding a third variable is simply done in the model specification argument for the lm
function and that such functions as summary
and anova
or Anova
would be done the same way as previously executed:
fit6 <- lm(salary~pubs+cits+degree_yrs, data=cohen1) # add degree_yrs as a 3rd IV to our original pubs+cits model
summary(fit6)
##
## Call:
## lm(formula = salary ~ pubs + cits + degree_yrs, data = cohen1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13907.1 -4313.8 -649.5 4366.1 21165.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38967.85 2394.31 16.275 < 2e-16 ***
## pubs 93.61 85.35 1.097 0.277273
## cits 204.06 56.97 3.582 0.000699 ***
## degree_yrs 874.46 283.89 3.080 0.003160 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7031 on 58 degrees of freedom
## Multiple R-squared: 0.5011, Adjusted R-squared: 0.4753
## F-statistic: 19.42 on 3 and 58 DF, p-value: 7.727e-09
## [[1]]
## NULL
##
## $`supplemental information`
## beta wt structure r partial r semipartial r tolerances unique
## pubs 0.13506 0.71500 0.14254 0.10172 0.56721 0.01035
## cits 0.36102 0.77662 0.42559 0.33219 0.84665 0.11035
## degree_yrs 0.38541 0.85873 0.37495 0.28567 0.54940 0.08161
## common total
## pubs 0.24584 0.25618
## cits 0.19190 0.30224
## degree_yrs 0.28793 0.36954
##
## $`var infl factors from HH:vif`
## pubs cits degree_yrs
## 1.763010 1.181128 1.820156
##
## [[4]]
## NULL
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 1.3093e+10 1 264.8822 < 2.2e-16 ***
## pubs 5.9458e+07 1 1.2029 0.2772728
## cits 6.3412e+08 1 12.8291 0.0006989 ***
## degree_yrs 4.6897e+08 1 9.4878 0.0031599 **
## Residuals 2.8669e+09 58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: salary ~ pubs + cits
## Model 2: salary ~ pubs + cits + degree_yrs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 59 3335822387
## 2 58 2866853951 1 468968436 9.4878 0.00316 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is also useful to examine the Variance Inflation Factors in multiple regression models. Unlike the two-IV model, each IV now has a different VIF since each has a different tolerance. Note that the IV with the smallest tolerance has the largest VIF.
## pubs cits degree_yrs
## 1.763010 1.181128 1.820156
Use the ‘tidy’ function from the ‘broom’ package to produce an alternative to the ‘summary’ table, and format it with ‘kable’:
fit6.tidy <- tidy(fit6) # tidy produces a nicer table than summary
kable(fit6.tidy) # kable converts the table to markdown format that is nicer than the default
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 38967.84674 | 2394.30824 | 16.275201 | 0.0000000 |
pubs | 93.60794 | 85.34837 | 1.096775 | 0.2772728 |
cits | 204.06019 | 56.97179 | 3.581776 | 0.0006989 |
degree_yrs | 874.46140 | 283.89492 | 3.080229 | 0.0031599 |
I actually prefer using the gt
function for tables whenever I can. But sometimes gt
doesn’t play nice with rmarkdown when rendering to a pdf document. For example, using it in the code chunk here, with a tidy object, creates a problem. I show the code commented so that you can see how to use it in other formats. The same analysis is tabled here using gt
.
Use the glance
function from the ‘broom’ package to obtain some additional information and then use ‘kable’ to format the output table. Note that I’ve relabeled the components of the glance
object to fit better with our typically used notation schemes. I also broke the table into two components to better fit on the page. This latter strategy produced a printing of the pvalue that lost the exponential notation and I’ve not yet figured out how to prevent that. It appears to be an issue with kable
, so I have printed the two tables without kable
formatting.
#library(broom)
glancefit6a <- glance(fit6) # additional info on the model
glancefit6at1 <- glancefit6a[1:5]
#glancefit6at1
glancefit6at2 <- glancefit6a[6:11]
#glancefit6at2
kable(glancefit6at1,
col.names=c("R.squared", "adj.R.squared", "RMSE","F value", "p.value"))
R.squared | adj.R.squared | RMSE | F value | p.value |
---|---|---|---|---|
0.5011234 | 0.4753195 | 7030.542 | 19.42041 | 0 |
kable(glancefit6at2,
col.names=c("df regression","logLik", "AIC", "BIC", "deviance", "df residual"),digits=3)
df regression | logLik | AIC | BIC | deviance | df residual |
---|---|---|---|---|---|
3 | -635.104 | 1280.208 | 1290.844 | 2866853951 | 58 |
Notes USING GLANCE:
- Users of the broom package should be aware that what
glance
previously reported as “df” would not have been the expected regression df that would equal the number of IVs. Instead it reported one more than that, including the df for the intercept. In the model above with 3 IVs, that apparently erroneous df was reported as 4, which is actually the rank of the design matrix. But the F test was always the standard MSreg/MSresidual, so it was confusing. That has been corrected in the most recent version used here (ver 0.7.0) so that the 3 df seen in the table here is the correct DFregression for a three-IV model. Users who installed the package prior to the correction of this “feature” should update their installation.
This document is created with R4.0.2
- The standard glance table is too wide - it has too many columns/values. So, I extracted two different subsets of its components and produced two different tables in the commented out code lines using
kable
. - I also relabeled the labels (in those commented out
kable
code lines) for the columns of the first table to be more in line with our standard terminology (RMSE is root mean square error, or the square root of MS residual)
Next, we use ‘tidy’ and ‘kable’ to obtain a nicely formatted anova summary table:
tidyanova6 <- tidy(anova(fit6))
kable(tidyanova6,
col.names=c("Source", "df", "SS", "MS", "F value", "p value"))
Source | df | SS | MS | F value | p value |
---|---|---|---|---|---|
pubs | 1 | 1472195326 | 1472195326 | 29.784332 | 0.0000010 |
cits | 1 | 938602110 | 938602110 | 18.989081 | 0.0000544 |
degree_yrs | 1 | 468968436 | 468968436 | 9.487811 | 0.0031599 |
Residuals | 58 | 2866853951 | 49428516 | NA | NA |
Or use the ‘Anova’ function:
tidyAnova6 <- tidy(Anova(fit6, type=3))
kable(tidyAnova6,
col.names=c("Source", "SS", "df", "F value", "p value"))
Source | SS | df | F value | p value |
---|---|---|---|---|
(Intercept) | 13092731852 | 1 | 264.882153 | 0.0000000 |
pubs | 59458298 | 1 | 1.202915 | 0.2772728 |
cits | 634124345 | 1 | 12.829119 | 0.0006989 |
degree_yrs | 468968436 | 1 | 9.487811 | 0.0031599 |
Residuals | 2866853951 | 58 | NA | NA |
Now add information from the ‘mrinfo’ function that was discussed above, extracting the additional useful information that we need for a fuller analysis than summary/anova/Anova gives us.
## [[1]]
## NULL
##
## $`supplemental information`
## beta wt structure r partial r semipartial r tolerances unique
## pubs 0.13506 0.71500 0.14254 0.10172 0.56721 0.01035
## cits 0.36102 0.77662 0.42559 0.33219 0.84665 0.11035
## degree_yrs 0.38541 0.85873 0.37495 0.28567 0.54940 0.08161
## common total
## pubs 0.24584 0.25618
## cits 0.19190 0.30224
## degree_yrs 0.28793 0.36954
##
## $`var infl factors from HH:vif`
## pubs cits degree_yrs
## 1.763010 1.181128 1.820156
##
## [[4]]
## NULL
13.1.1 Conclusions from the three-IV model
Several interesting outcomes emerged with the addition of degree_yrs to the model that already included pubs and cits. The multiple R-squared did increase to about .50 so degree_yrs did appear to improve predictability. But an interesting thing happened to the effect of pubs in this three-IV model. The regression coefficient was about 251 in the two-IV model, but it dropped to about 94 in this three-IV model. Pubs was not a significant predictor in the three-IV model and it’s unique fraction of variance dropped to about 1% of the DV variation. Most of the overlap of pubs with salary was shared by the other IV’s (common fraction of variance was about .25, or nearly all of its zero order overlap with salary). This is because degree_yrs must have been strongly correlated with pubs. Lets examine that correlation.
## [1] 0.6505472
A better model might be one that simply leaves pubs out since degree_yrs absorbed most of its shared variance.
fit6b <- lm(salary~cits+degree_yrs,data=cohen1)
fit6b.tidy <- tidy(fit6b) # tidy produces a nicer table than summary
kable(fit6b.tidy)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 39073.6747 | 2396.47360 | 16.304655 | 0.0000000 |
cits | 212.1116 | 56.59392 | 3.747958 | 0.0004078 |
degree_yrs | 1061.7642 | 227.17563 | 4.673759 | 0.0000176 |
## [[1]]
## NULL
##
## $`supplemental information`
## beta wt structure r partial r semipartial r tolerances unique
## cits 0.37526 0.78476 0.43852 0.3482 0.86094 0.12124
## degree_yrs 0.46796 0.86773 0.51981 0.4342 0.86094 0.18853
## common total
## cits 0.181 0.30224
## degree_yrs 0.181 0.36954
##
## $`var infl factors from HH:vif`
## cits degree_yrs
## 1.161517 1.161517
##
## [[4]]
## NULL
glancefit6b <- glance(fit6b)
glancefit6bt1 <- glancefit6b[1:5]
#glancefit6bt1
glancefit6bt2 <- glancefit6b[6:11]
#glancefit6bt2
kable(glancefit6bt1,
col.names=c("R.squared", "adj.R.squared", "RMSE","F value", "p.value"))
R.squared | adj.R.squared | RMSE | F value | p.value |
---|---|---|---|---|
0.4907768 | 0.473515 | 7042.621 | 28.43137 | 0 |
kable(glancefit6bt2,
col.names=c("df regression","logLik", "AIC", "BIC", "deviance", "df residual"),digits=3)
df regression | logLik | AIC | BIC | deviance | df residual |
---|---|---|---|---|---|
2 | -635.74 | 1279.481 | 1287.989 | 2926312249 | 59 |
The revised two-IV model with degree_yrs and cits as IVs had an R-squared nearly as large as the three-IV model, and the AIC/BIC indices were slightly smaller indicating a better model. Both degree_yrs and cits were significant IVs as tested by the t-tests. Thus the most parsimonious model might be this revised two-IV model. Perhaps pubs was only a good IV because it was correlated so highly with degree_yrs and it is the latter that is the more important predictor.
13.2 Categorical IVs
We can do a quick preview of using categorical IVs in linear models. The cohen data set also had a categorical variable, gender, that could potentially serve as an IV. However, it is a factor and its values in the data set are the string values of “female” and “male”. How could that be used as an IV in regression? The answer is that factors can be recoded to have numeric values with specially constructed coded values. For a factor such as gender with only two levels the code is simply a one and a zero. This is already built in to the characteristics of the factor that is stored in the data frame. These numeric codes are called contrasts in R:
## male
## female 0
## male 1
With that knowledge, we now see that lm
can handle such categorical variables by using the numeric contrast code. First I will illustrate with a simple regression using only gender as the IV.
##
## Call:
## lm(formula = salary ~ gender, data = cohen1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18660.3 -5819.8 -3.7 4769.2 26903.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52650 1810 29.080 <2e-16 ***
## gendermale 3949 2445 1.615 0.111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9580 on 60 degrees of freedom
## Multiple R-squared: 0.04168, Adjusted R-squared: 0.0257
## F-statistic: 2.609 on 1 and 60 DF, p-value: 0.1115
It appears that gender is not a strong predictor of salary (R-squared is small, and t-test is NS). But lets focus on two additional points. One is notation. Observe that the name of the gender variable in the coefficients table is not simply gender, the IV name. Rather it is found there as “gendermale”. That notation reminds us that gender is a factor and the reason that it is called gendermale rather than genderfemale is that male was coded with a 1 as seen above. The zero and one assignments are arbitrary and could be reversed. We will soon extend this conversation with more complex categorical IVs that have more than two categories.
The second point of emphasis is the thinking that surrounds the question of what does it mean to think of gender as “correlated” with salary, or that it can “predict” salary? Perhaps a better phrasing might be that we have assessed whether gender is “associated” with salary. We have concluded that it is not, but what if it were? That could only mean that average salaries might differ between females and males. And that sounds just like a two-independent-samples t-test situation. So let’s do that test.
##
## Two Sample t-test
##
## data: salary by gender
## t = -1.6153, df = 60, p-value = 0.1115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8839.888 941.241
## sample estimates:
## mean in group female mean in group male
## 52650.00 56599.32
Note that the male mean is about $4000 higher than the female mean. But the difference was not significant. A bit of a closer examination of the output reveals interesting coincidences. Note that the df for the t, the t-value itself, and the p value match what was reported for the regression coefficient in the above lm
fit. This is not coincidence because the “association” question approached with regression is tantamount to the mean difference question posed by the independent samples t-test. Soon enough, we will progress through the technical reasons that these are the same inference.
For one final illustration in this chapter, we now build a regression model that adds gender to the earlier model that had degree_yrs and citations as IVs.
##
## Call:
## lm(formula = salary ~ degree_yrs + cits + gender, data = cohen1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14214.6 -4590.3 -312.3 4014.9 21917.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38787.21 2478.78 15.648 < 2e-16 ***
## degree_yrs 1040.41 232.58 4.473 3.64e-05 ***
## cits 210.14 57.09 3.681 0.000512 ***
## gendermale 931.06 1859.70 0.501 0.618513
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7088 on 58 degrees of freedom
## Multiple R-squared: 0.493, Adjusted R-squared: 0.4667
## F-statistic: 18.8 on 3 and 58 DF, p-value: 1.227e-08
In this model gender is also not a significant IV and the regression coefficient is smaller than it was in simple regression indicating that any variance that it shared with salary was at least partially confounded with degree_yrs and cits.
One final comment about this type of modeling with a variable such as gender. It might be an interesting research question whether the predictability that IVs such as degree_yrs and cits have on salary differs in the two levels of a categorical variable such as gender. In other words, does the influence of degree_yrs and cits depend on whether we examine males or females? This is what we will come to call an interaction question. It can be modeled easily with lm
. I will show the code/results for the three-IV model with all interactions as well as the results, but will save comment until we cover the interaction topic more explicitly.
##
## Call:
## lm(formula = salary ~ degree_yrs * cits * gender, data = cohen1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13966.3 -4404.6 -773.2 3657.9 22513.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26877.05 8716.17 3.084 0.00322 **
## degree_yrs 3715.49 1591.34 2.335 0.02330 *
## cits 632.12 243.37 2.597 0.01208 *
## gendermale 11287.69 10484.27 1.077 0.28643
## degree_yrs:cits -89.18 45.11 -1.977 0.05315 .
## degree_yrs:gendermale -2442.09 1793.28 -1.362 0.17892
## cits:gendermale -403.30 275.46 -1.464 0.14897
## degree_yrs:cits:gendermale 86.42 47.53 1.818 0.07457 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7051 on 54 degrees of freedom
## Multiple R-squared: 0.5328, Adjusted R-squared: 0.4722
## F-statistic: 8.798 on 7 and 54 DF, p-value: 3.563e-07