Chapter 13 Extensions to larger models and to categorical IVs

In this chapter, the goals are to extend the suite of methods to models with additional characteristics. This includes a model with more than two IVs but we will also consider using a categorical IV toward the end. At the same time, this chapter provides some emphasis on strong capabilities in rmarkdown and the “tidyverse” to generate nicer tables. Some of that has been seen in other chapters with the kable and gt functions, but it is expanded here.

The capability of rmarkdown, bookdown and knitr to produce a document such as this one is accompanied by increased options for formatting the output. The typical text output from R functions can be converted to style/fonts that are more readable. A few brief examples are illustrated in this section. The kable function permits enhancement of tables such as those coming from summary and anova. But capabilities in the broom package also permit enhancement of the standard output coming from `summary. The reader should recognize that whole documents/manuscripts can be written in rmarkdown, even in APA style

13.1 Evaluate a 3-IV model

We will use the same data set that we used above with the enhanced output and formatting. However we will extend the analysis to three IVs. The reader should take note of the fact that adding a third variable is simply done in the model specification argument for the lm function and that such functions as summary and anova or Anova would be done the same way as previously executed:

fit6 <- lm(salary~pubs+cits+degree_yrs, data=cohen1)  # add degree_yrs as a 3rd IV to our original pubs+cits model
summary(fit6)
## 
## Call:
## lm(formula = salary ~ pubs + cits + degree_yrs, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13907.1  -4313.8   -649.5   4366.1  21165.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38967.85    2394.31  16.275  < 2e-16 ***
## pubs           93.61      85.35   1.097 0.277273    
## cits          204.06      56.97   3.582 0.000699 ***
## degree_yrs    874.46     283.89   3.080 0.003160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7031 on 58 degrees of freedom
## Multiple R-squared:  0.5011,	Adjusted R-squared:  0.4753 
## F-statistic: 19.42 on 3 and 58 DF,  p-value: 7.727e-09
mrinfo(fit6, minimal=TRUE)
## [[1]]
## NULL
## 
## $`supplemental information`
##            beta wt structure r partial r semipartial r tolerances  unique
## pubs       0.13506     0.71500   0.14254       0.10172    0.56721 0.01035
## cits       0.36102     0.77662   0.42559       0.33219    0.84665 0.11035
## degree_yrs 0.38541     0.85873   0.37495       0.28567    0.54940 0.08161
##             common   total
## pubs       0.24584 0.25618
## cits       0.19190 0.30224
## degree_yrs 0.28793 0.36954
## 
## $`var infl factors from HH:vif`
##       pubs       cits degree_yrs 
##   1.763010   1.181128   1.820156 
## 
## [[4]]
## NULL
Anova(fit6, type=3)
## Anova Table (Type III tests)
## 
## Response: salary
##                 Sum Sq Df  F value    Pr(>F)    
## (Intercept) 1.3093e+10  1 264.8822 < 2.2e-16 ***
## pubs        5.9458e+07  1   1.2029 0.2772728    
## cits        6.3412e+08  1  12.8291 0.0006989 ***
## degree_yrs  4.6897e+08  1   9.4878 0.0031599 ** 
## Residuals   2.8669e+09 58                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit3,fit6)
## Analysis of Variance Table
## 
## Model 1: salary ~ pubs + cits
## Model 2: salary ~ pubs + cits + degree_yrs
##   Res.Df        RSS Df Sum of Sq      F  Pr(>F)   
## 1     59 3335822387                               
## 2     58 2866853951  1 468968436 9.4878 0.00316 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is also useful to examine the Variance Inflation Factors in multiple regression models. Unlike the two-IV model, each IV now has a different VIF since each has a different tolerance. Note that the IV with the smallest tolerance has the largest VIF.

vif(fit6)
##       pubs       cits degree_yrs 
##   1.763010   1.181128   1.820156

Use the ‘tidy’ function from the ‘broom’ package to produce an alternative to the ‘summary’ table, and format it with ‘kable’:

fit6.tidy <- tidy(fit6) # tidy produces a nicer table than summary
kable(fit6.tidy) # kable converts the table to markdown format that is nicer than the default
term estimate std.error statistic p.value
(Intercept) 38967.84674 2394.30824 16.275201 0.0000000
pubs 93.60794 85.34837 1.096775 0.2772728
cits 204.06019 56.97179 3.581776 0.0006989
degree_yrs 874.46140 283.89492 3.080229 0.0031599

I actually prefer using the gt function for tables whenever I can. But sometimes gt doesn’t play nice with rmarkdown when rendering to a pdf document. For example, using it in the code chunk here, with a tidy object, creates a problem. I show the code commented so that you can see how to use it in other formats. The same analysis is tabled here using gt.

#gt(fit6.tidy) # `gt` converts the table to markdown format that is nicer than the default

Use the glance function from the ‘broom’ package to obtain some additional information and then use ‘kable’ to format the output table. Note that I’ve relabeled the components of the glance object to fit better with our typically used notation schemes. I also broke the table into two components to better fit on the page. This latter strategy produced a printing of the pvalue that lost the exponential notation and I’ve not yet figured out how to prevent that. It appears to be an issue with kable, so I have printed the two tables without kable formatting.

#library(broom)
glancefit6a <- glance(fit6) # additional info on the model
glancefit6at1 <- glancefit6a[1:5]
#glancefit6at1
glancefit6at2 <- glancefit6a[6:11]
#glancefit6at2
kable(glancefit6at1,
      col.names=c("R.squared", "adj.R.squared", "RMSE","F value", "p.value"))
R.squared adj.R.squared RMSE F value p.value
0.5011234 0.4753195 7030.542 19.42041 0
kable(glancefit6at2,
      col.names=c("df regression","logLik", "AIC", "BIC", "deviance", "df residual"),digits=3)
df regression logLik AIC BIC deviance df residual
3 -635.104 1280.208 1290.844 2866853951 58

Notes USING GLANCE:

  1. Users of the broom package should be aware that what glance previously reported as “df” would not have been the expected regression df that would equal the number of IVs. Instead it reported one more than that, including the df for the intercept. In the model above with 3 IVs, that apparently erroneous df was reported as 4, which is actually the rank of the design matrix. But the F test was always the standard MSreg/MSresidual, so it was confusing. That has been corrected in the most recent version used here (ver 0.7.0) so that the 3 df seen in the table here is the correct DFregression for a three-IV model. Users who installed the package prior to the correction of this “feature” should update their installation.

This document is created with R4.0.2

  1. The standard glance table is too wide - it has too many columns/values. So, I extracted two different subsets of its components and produced two different tables in the commented out code lines using kable.
  2. I also relabeled the labels (in those commented out kable code lines) for the columns of the first table to be more in line with our standard terminology (RMSE is root mean square error, or the square root of MS residual)

Next, we use ‘tidy’ and ‘kable’ to obtain a nicely formatted anova summary table:

tidyanova6 <- tidy(anova(fit6))
kable(tidyanova6,
      col.names=c("Source", "df", "SS", "MS", "F value", "p value"))
Source df SS MS F value p value
pubs 1 1472195326 1472195326 29.784332 0.0000010
cits 1 938602110 938602110 18.989081 0.0000544
degree_yrs 1 468968436 468968436 9.487811 0.0031599
Residuals 58 2866853951 49428516 NA NA

Or use the ‘Anova’ function:

tidyAnova6 <- tidy(Anova(fit6, type=3))
kable(tidyAnova6,
      col.names=c("Source", "SS", "df", "F value", "p value"))
Source SS df F value p value
(Intercept) 13092731852 1 264.882153 0.0000000
pubs 59458298 1 1.202915 0.2772728
cits 634124345 1 12.829119 0.0006989
degree_yrs 468968436 1 9.487811 0.0031599
Residuals 2866853951 58 NA NA

Now add information from the ‘mrinfo’ function that was discussed above, extracting the additional useful information that we need for a fuller analysis than summary/anova/Anova gives us.

mrinfo(fit6, minimal=TRUE)
## [[1]]
## NULL
## 
## $`supplemental information`
##            beta wt structure r partial r semipartial r tolerances  unique
## pubs       0.13506     0.71500   0.14254       0.10172    0.56721 0.01035
## cits       0.36102     0.77662   0.42559       0.33219    0.84665 0.11035
## degree_yrs 0.38541     0.85873   0.37495       0.28567    0.54940 0.08161
##             common   total
## pubs       0.24584 0.25618
## cits       0.19190 0.30224
## degree_yrs 0.28793 0.36954
## 
## $`var infl factors from HH:vif`
##       pubs       cits degree_yrs 
##   1.763010   1.181128   1.820156 
## 
## [[4]]
## NULL

13.1.1 Conclusions from the three-IV model

Several interesting outcomes emerged with the addition of degree_yrs to the model that already included pubs and cits. The multiple R-squared did increase to about .50 so degree_yrs did appear to improve predictability. But an interesting thing happened to the effect of pubs in this three-IV model. The regression coefficient was about 251 in the two-IV model, but it dropped to about 94 in this three-IV model. Pubs was not a significant predictor in the three-IV model and it’s unique fraction of variance dropped to about 1% of the DV variation. Most of the overlap of pubs with salary was shared by the other IV’s (common fraction of variance was about .25, or nearly all of its zero order overlap with salary). This is because degree_yrs must have been strongly correlated with pubs. Lets examine that correlation.

with(cohen1, cor(pubs,degree_yrs))
## [1] 0.6505472

A better model might be one that simply leaves pubs out since degree_yrs absorbed most of its shared variance.

fit6b <- lm(salary~cits+degree_yrs,data=cohen1)
fit6b.tidy <- tidy(fit6b) # tidy produces a nicer table than summary
kable(fit6b.tidy)
term estimate std.error statistic p.value
(Intercept) 39073.6747 2396.47360 16.304655 0.0000000
cits 212.1116 56.59392 3.747958 0.0004078
degree_yrs 1061.7642 227.17563 4.673759 0.0000176
mrinfo(fit6b, minimal=TRUE)
## [[1]]
## NULL
## 
## $`supplemental information`
##            beta wt structure r partial r semipartial r tolerances  unique
## cits       0.37526     0.78476   0.43852        0.3482    0.86094 0.12124
## degree_yrs 0.46796     0.86773   0.51981        0.4342    0.86094 0.18853
##            common   total
## cits        0.181 0.30224
## degree_yrs  0.181 0.36954
## 
## $`var infl factors from HH:vif`
##       cits degree_yrs 
##   1.161517   1.161517 
## 
## [[4]]
## NULL
glancefit6b <- glance(fit6b)
glancefit6bt1 <- glancefit6b[1:5]
#glancefit6bt1
glancefit6bt2 <- glancefit6b[6:11]
#glancefit6bt2
kable(glancefit6bt1,
      col.names=c("R.squared", "adj.R.squared", "RMSE","F value", "p.value"))
R.squared adj.R.squared RMSE F value p.value
0.4907768 0.473515 7042.621 28.43137 0
kable(glancefit6bt2,
      col.names=c("df regression","logLik", "AIC", "BIC", "deviance", "df residual"),digits=3)
df regression logLik AIC BIC deviance df residual
2 -635.74 1279.481 1287.989 2926312249 59

The revised two-IV model with degree_yrs and cits as IVs had an R-squared nearly as large as the three-IV model, and the AIC/BIC indices were slightly smaller indicating a better model. Both degree_yrs and cits were significant IVs as tested by the t-tests. Thus the most parsimonious model might be this revised two-IV model. Perhaps pubs was only a good IV because it was correlated so highly with degree_yrs and it is the latter that is the more important predictor.

13.2 Categorical IVs

We can do a quick preview of using categorical IVs in linear models. The cohen data set also had a categorical variable, gender, that could potentially serve as an IV. However, it is a factor and its values in the data set are the string values of “female” and “male”. How could that be used as an IV in regression? The answer is that factors can be recoded to have numeric values with specially constructed coded values. For a factor such as gender with only two levels the code is simply a one and a zero. This is already built in to the characteristics of the factor that is stored in the data frame. These numeric codes are called contrasts in R:

contrasts(cohen1$gender)
##        male
## female    0
## male      1

With that knowledge, we now see that lm can handle such categorical variables by using the numeric contrast code. First I will illustrate with a simple regression using only gender as the IV.

fit7 <- lm(salary~gender, data=cohen1)
summary(fit7)
## 
## Call:
## lm(formula = salary ~ gender, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18660.3  -5819.8     -3.7   4769.2  26903.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    52650       1810  29.080   <2e-16 ***
## gendermale      3949       2445   1.615    0.111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9580 on 60 degrees of freedom
## Multiple R-squared:  0.04168,	Adjusted R-squared:  0.0257 
## F-statistic: 2.609 on 1 and 60 DF,  p-value: 0.1115

It appears that gender is not a strong predictor of salary (R-squared is small, and t-test is NS). But lets focus on two additional points. One is notation. Observe that the name of the gender variable in the coefficients table is not simply gender, the IV name. Rather it is found there as “gendermale”. That notation reminds us that gender is a factor and the reason that it is called gendermale rather than genderfemale is that male was coded with a 1 as seen above. The zero and one assignments are arbitrary and could be reversed. We will soon extend this conversation with more complex categorical IVs that have more than two categories.

The second point of emphasis is the thinking that surrounds the question of what does it mean to think of gender as “correlated” with salary, or that it can “predict” salary? Perhaps a better phrasing might be that we have assessed whether gender is “associated” with salary. We have concluded that it is not, but what if it were? That could only mean that average salaries might differ between females and males. And that sounds just like a two-independent-samples t-test situation. So let’s do that test.

t.test(salary~gender, var.equal=T, data=cohen1)
## 
## 	Two Sample t-test
## 
## data:  salary by gender
## t = -1.6153, df = 60, p-value = 0.1115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8839.888   941.241
## sample estimates:
## mean in group female   mean in group male 
##             52650.00             56599.32

Note that the male mean is about $4000 higher than the female mean. But the difference was not significant. A bit of a closer examination of the output reveals interesting coincidences. Note that the df for the t, the t-value itself, and the p value match what was reported for the regression coefficient in the above lm fit. This is not coincidence because the “association” question approached with regression is tantamount to the mean difference question posed by the independent samples t-test. Soon enough, we will progress through the technical reasons that these are the same inference.

For one final illustration in this chapter, we now build a regression model that adds gender to the earlier model that had degree_yrs and citations as IVs.

fit7b <- lm(salary~degree_yrs + cits + gender, data=cohen1)
summary(fit7b)
## 
## Call:
## lm(formula = salary ~ degree_yrs + cits + gender, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14214.6  -4590.3   -312.3   4014.9  21917.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38787.21    2478.78  15.648  < 2e-16 ***
## degree_yrs   1040.41     232.58   4.473 3.64e-05 ***
## cits          210.14      57.09   3.681 0.000512 ***
## gendermale    931.06    1859.70   0.501 0.618513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7088 on 58 degrees of freedom
## Multiple R-squared:  0.493,	Adjusted R-squared:  0.4667 
## F-statistic:  18.8 on 3 and 58 DF,  p-value: 1.227e-08

In this model gender is also not a significant IV and the regression coefficient is smaller than it was in simple regression indicating that any variance that it shared with salary was at least partially confounded with degree_yrs and cits.

One final comment about this type of modeling with a variable such as gender. It might be an interesting research question whether the predictability that IVs such as degree_yrs and cits have on salary differs in the two levels of a categorical variable such as gender. In other words, does the influence of degree_yrs and cits depend on whether we examine males or females? This is what we will come to call an interaction question. It can be modeled easily with lm. I will show the code/results for the three-IV model with all interactions as well as the results, but will save comment until we cover the interaction topic more explicitly.

fit7c <- lm(salary~degree_yrs*cits*gender, data=cohen1)
summary(fit7c)
## 
## Call:
## lm(formula = salary ~ degree_yrs * cits * gender, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13966.3  -4404.6   -773.2   3657.9  22513.1 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                26877.05    8716.17   3.084  0.00322 **
## degree_yrs                  3715.49    1591.34   2.335  0.02330 * 
## cits                         632.12     243.37   2.597  0.01208 * 
## gendermale                 11287.69   10484.27   1.077  0.28643   
## degree_yrs:cits              -89.18      45.11  -1.977  0.05315 . 
## degree_yrs:gendermale      -2442.09    1793.28  -1.362  0.17892   
## cits:gendermale             -403.30     275.46  -1.464  0.14897   
## degree_yrs:cits:gendermale    86.42      47.53   1.818  0.07457 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7051 on 54 degrees of freedom
## Multiple R-squared:  0.5328,	Adjusted R-squared:  0.4722 
## F-statistic: 8.798 on 7 and 54 DF,  p-value: 3.563e-07