Chapter 13 Extensions to larger models and to categorical IVs

In this chapter, the goals are to extend the suite of methods to models with additional characteristics. This includes a model with more than two IVs but we will also consider using a categorical IV toward the end. At the same time, this chapter provides some emphasis on strong capabilities in rmarkdown and the “tidyverse” to generate nicer tables. Some of that has been seen in other chapters with the kable and gt functions, but it is expanded here.

The capability of rmarkdown, bookdown and knitr to produce a document such as this one is accompanied by increased options for formatting the output. The typical text output from R functions can be converted to style/fonts that are more readable. A few brief examples are illustrated in this section. The kable function permits enhancement of tables such as those coming from summary and anova. But capabilities in the broom package also permit enhancement of the standard output coming from `summary. The reader should recognize that whole documents/manuscripts can be written in rmarkdown, even in APA style

13.1 Evaluate a 3-IV model

We will use the same data set that we used above with the enhanced output and formatting. However we will extend the analysis to three IVs. The reader should take note of the fact that adding a third variable is simply done in the model specification argument for the lm function and that such functions as summary and anova or Anova would be done the same way as previously executed:

fit6 <- lm(salary~pubs+cits+degree_yrs, data=cohen1)  # add degree_yrs as a 3rd IV to our original pubs+cits model
summary(fit6)

## 
## Call:
## lm(formula = salary ~ pubs + cits + degree_yrs, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13907.1  -4313.8   -649.5   4366.1  21165.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38967.85    2394.31  16.275  < 2e-16 ***
## pubs           93.61      85.35   1.097 0.277273    
## cits          204.06      56.97   3.582 0.000699 ***
## degree_yrs    874.46     283.89   3.080 0.003160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7031 on 58 degrees of freedom
## Multiple R-squared:  0.5011,	Adjusted R-squared:  0.4753 
## F-statistic: 19.42 on 3 and 58 DF,  p-value: 7.727e-09

mrinfo(fit6, minimal=TRUE)

## [[1]]
## NULL
## 
## $`supplemental information`
##            beta wt structure r partial r semipartial r tolerances  unique
## pubs       0.13506     0.71500   0.14254       0.10172    0.56721 0.01035
## cits       0.36102     0.77662   0.42559       0.33219    0.84665 0.11035
## degree_yrs 0.38541     0.85873   0.37495       0.28567    0.54940 0.08161
##             common   total
## pubs       0.24584 0.25618
## cits       0.19190 0.30224
## degree_yrs 0.28793 0.36954
## 
## $`var infl factors from HH:vif`
##       pubs       cits degree_yrs 
##   1.763010   1.181128   1.820156 
## 
## [[4]]
## NULL

Anova(fit6, type=3)

## Anova Table (Type III tests)
## 
## Response: salary
##                 Sum Sq Df  F value    Pr(>F)    
## (Intercept) 1.3093e+10  1 264.8822 < 2.2e-16 ***
## pubs        5.9458e+07  1   1.2029 0.2772728    
## cits        6.3412e+08  1  12.8291 0.0006989 ***
## degree_yrs  4.6897e+08  1   9.4878 0.0031599 ** 
## Residuals   2.8669e+09 58                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit3,fit6)

## Analysis of Variance Table
## 
## Model 1: salary ~ pubs + cits
## Model 2: salary ~ pubs + cits + degree_yrs
##   Res.Df        RSS Df Sum of Sq      F  Pr(>F)   
## 1     59 3335822387                               
## 2     58 2866853951  1 468968436 9.4878 0.00316 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is also useful to examine the Variance Inflation Factors in multiple regression models. Unlike the two-IV model, each IV now has a different VIF since each has a different tolerance. Note that the IV with the smallest tolerance has the largest VIF.

vif(fit6)

##       pubs       cits degree_yrs 
##   1.763010   1.181128   1.820156

Use the ‘tidy’ function from the ‘broom’ package to produce an alternative to the ‘summary’ table, and format it with ‘kable’:

fit6.tidy <- tidy(fit6) # tidy produces a nicer table than summary
kable(fit6.tidy) # kable converts the table to markdown format that is nicer than the default

term	estimate	std.error	statistic	p.value
(Intercept)	38967.84674	2394.30824	16.275201	0.0000000
pubs	93.60794	85.34837	1.096775	0.2772728
cits	204.06019	56.97179	3.581776	0.0006989
degree_yrs	874.46140	283.89492	3.080229	0.0031599

I actually prefer using the gt function for tables whenever I can. But sometimes gt doesn’t play nice with rmarkdown when rendering to a pdf document. For example, using it in the code chunk here, with a tidy object, creates a problem. I show the code commented so that you can see how to use it in other formats. The same analysis is tabled here using gt.

#gt(fit6.tidy) # `gt` converts the table to markdown format that is nicer than the default

Use the glance function from the ‘broom’ package to obtain some additional information and then use ‘kable’ to format the output table. Note that I’ve relabeled the components of the glance object to fit better with our typically used notation schemes. I also broke the table into two components to better fit on the page. This latter strategy produced a printing of the pvalue that lost the exponential notation and I’ve not yet figured out how to prevent that. It appears to be an issue with kable, so I have printed the two tables without kable formatting.

#library(broom)
glancefit6a <- glance(fit6) # additional info on the model
glancefit6at1 <- glancefit6a[1:5]
#glancefit6at1
glancefit6at2 <- glancefit6a[6:11]
#glancefit6at2
kable(glancefit6at1,
      col.names=c("R.squared", "adj.R.squared", "RMSE","F value", "p.value"))

R.squared	adj.R.squared	RMSE	F value	p.value
0.5011234	0.4753195	7030.542	19.42041	0

kable(glancefit6at2,
      col.names=c("df regression","logLik", "AIC", "BIC", "deviance", "df residual"),digits=3)

df regression	logLik	AIC	BIC	deviance	df residual
3	-635.104	1280.208	1290.844	2866853951	58

Notes USING GLANCE:

Users of the broom package should be aware that what glance previously reported as “df” would not have been the expected regression df that would equal the number of IVs. Instead it reported one more than that, including the df for the intercept. In the model above with 3 IVs, that apparently erroneous df was reported as 4, which is actually the rank of the design matrix. But the F test was always the standard MSreg/MSresidual, so it was confusing. That has been corrected in the most recent version used here (ver 0.7.0) so that the 3 df seen in the table here is the correct DFregression for a three-IV model. Users who installed the package prior to the correction of this “feature” should update their installation.

This document is created with R4.0.2

The standard glance table is too wide - it has too many columns/values. So, I extracted two different subsets of its components and produced two different tables in the commented out code lines using kable.
I also relabeled the labels (in those commented out kable code lines) for the columns of the first table to be more in line with our standard terminology (RMSE is root mean square error, or the square root of MS residual)

Next, we use ‘tidy’ and ‘kable’ to obtain a nicely formatted anova summary table:

tidyanova6 <- tidy(anova(fit6))
kable(tidyanova6,
      col.names=c("Source", "df", "SS", "MS", "F value", "p value"))

Source	df	SS	MS	F value	p value
pubs	1	1472195326	1472195326	29.784332	0.0000010
cits	1	938602110	938602110	18.989081	0.0000544
degree_yrs	1	468968436	468968436	9.487811	0.0031599
Residuals	58	2866853951	49428516	NA	NA

Or use the ‘Anova’ function:

tidyAnova6 <- tidy(Anova(fit6, type=3))
kable(tidyAnova6,
      col.names=c("Source", "SS", "df", "F value", "p value"))

Source	SS	df	F value	p value
(Intercept)	13092731852	1	264.882153	0.0000000
pubs	59458298	1	1.202915	0.2772728
cits	634124345	1	12.829119	0.0006989
degree_yrs	468968436	1	9.487811	0.0031599
Residuals	2866853951	58	NA	NA

Now add information from the ‘mrinfo’ function that was discussed above, extracting the additional useful information that we need for a fuller analysis than summary/anova/Anova gives us.

mrinfo(fit6, minimal=TRUE)

## [[1]]
## NULL
## 
## $`supplemental information`
##            beta wt structure r partial r semipartial r tolerances  unique
## pubs       0.13506     0.71500   0.14254       0.10172    0.56721 0.01035
## cits       0.36102     0.77662   0.42559       0.33219    0.84665 0.11035
## degree_yrs 0.38541     0.85873   0.37495       0.28567    0.54940 0.08161
##             common   total
## pubs       0.24584 0.25618
## cits       0.19190 0.30224
## degree_yrs 0.28793 0.36954
## 
## $`var infl factors from HH:vif`
##       pubs       cits degree_yrs 
##   1.763010   1.181128   1.820156 
## 
## [[4]]
## NULL

13.1.1 Conclusions from the three-IV model

Several interesting outcomes emerged with the addition of degree_yrs to the model that already included pubs and cits. The multiple R-squared did increase to about .50 so degree_yrs did appear to improve predictability. But an interesting thing happened to the effect of pubs in this three-IV model. The regression coefficient was about 251 in the two-IV model, but it dropped to about 94 in this three-IV model. Pubs was not a significant predictor in the three-IV model and it’s unique fraction of variance dropped to about 1% of the DV variation. Most of the overlap of pubs with salary was shared by the other IV’s (common fraction of variance was about .25, or nearly all of its zero order overlap with salary). This is because degree_yrs must have been strongly correlated with pubs. Lets examine that correlation.

with(cohen1, cor(pubs,degree_yrs))

## [1] 0.6505472

A better model might be one that simply leaves pubs out since degree_yrs absorbed most of its shared variance.

fit6b <- lm(salary~cits+degree_yrs,data=cohen1)
fit6b.tidy <- tidy(fit6b) # tidy produces a nicer table than summary
kable(fit6b.tidy)

term	estimate	std.error	statistic	p.value
(Intercept)	39073.6747	2396.47360	16.304655	0.0000000
cits	212.1116	56.59392	3.747958	0.0004078
degree_yrs	1061.7642	227.17563	4.673759	0.0000176

mrinfo(fit6b, minimal=TRUE)

## [[1]]
## NULL
## 
## $`supplemental information`
##            beta wt structure r partial r semipartial r tolerances  unique
## cits       0.37526     0.78476   0.43852        0.3482    0.86094 0.12124
## degree_yrs 0.46796     0.86773   0.51981        0.4342    0.86094 0.18853
##            common   total
## cits        0.181 0.30224
## degree_yrs  0.181 0.36954
## 
## $`var infl factors from HH:vif`
##       cits degree_yrs 
##   1.161517   1.161517 
## 
## [[4]]
## NULL

glancefit6b <- glance(fit6b)
glancefit6bt1 <- glancefit6b[1:5]
#glancefit6bt1
glancefit6bt2 <- glancefit6b[6:11]
#glancefit6bt2
kable(glancefit6bt1,
      col.names=c("R.squared", "adj.R.squared", "RMSE","F value", "p.value"))

R.squared	adj.R.squared	RMSE	F value	p.value
0.4907768	0.473515	7042.621	28.43137	0

kable(glancefit6bt2,
      col.names=c("df regression","logLik", "AIC", "BIC", "deviance", "df residual"),digits=3)

df regression	logLik	AIC	BIC	deviance	df residual
2	-635.74	1279.481	1287.989	2926312249	59

The revised two-IV model with degree_yrs and cits as IVs had an R-squared nearly as large as the three-IV model, and the AIC/BIC indices were slightly smaller indicating a better model. Both degree_yrs and cits were significant IVs as tested by the t-tests. Thus the most parsimonious model might be this revised two-IV model. Perhaps pubs was only a good IV because it was correlated so highly with degree_yrs and it is the latter that is the more important predictor.

13.2 Categorical IVs

We can do a quick preview of using categorical IVs in linear models. The cohen data set also had a categorical variable, gender, that could potentially serve as an IV. However, it is a factor and its values in the data set are the string values of “female” and “male”. How could that be used as an IV in regression? The answer is that factors can be recoded to have numeric values with specially constructed coded values. For a factor such as gender with only two levels the code is simply a one and a zero. This is already built in to the characteristics of the factor that is stored in the data frame. These numeric codes are called contrasts in R:

contrasts(cohen1$gender)

##        male
## female    0
## male      1

With that knowledge, we now see that lm can handle such categorical variables by using the numeric contrast code. First I will illustrate with a simple regression using only gender as the IV.

fit7 <- lm(salary~gender, data=cohen1)
summary(fit7)

## 
## Call:
## lm(formula = salary ~ gender, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18660.3  -5819.8     -3.7   4769.2  26903.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    52650       1810  29.080   <2e-16 ***
## gendermale      3949       2445   1.615    0.111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9580 on 60 degrees of freedom
## Multiple R-squared:  0.04168,	Adjusted R-squared:  0.0257 
## F-statistic: 2.609 on 1 and 60 DF,  p-value: 0.1115

It appears that gender is not a strong predictor of salary (R-squared is small, and t-test is NS). But lets focus on two additional points. One is notation. Observe that the name of the gender variable in the coefficients table is not simply gender, the IV name. Rather it is found there as “gendermale”. That notation reminds us that gender is a factor and the reason that it is called gendermale rather than genderfemale is that male was coded with a 1 as seen above. The zero and one assignments are arbitrary and could be reversed. We will soon extend this conversation with more complex categorical IVs that have more than two categories.

The second point of emphasis is the thinking that surrounds the question of what does it mean to think of gender as “correlated” with salary, or that it can “predict” salary? Perhaps a better phrasing might be that we have assessed whether gender is “associated” with salary. We have concluded that it is not, but what if it were? That could only mean that average salaries might differ between females and males. And that sounds just like a two-independent-samples t-test situation. So let’s do that test.

t.test(salary~gender, var.equal=T, data=cohen1)

## 
## 	Two Sample t-test
## 
## data:  salary by gender
## t = -1.6153, df = 60, p-value = 0.1115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8839.888   941.241
## sample estimates:
## mean in group female   mean in group male 
##             52650.00             56599.32

Note that the male mean is about $4000 higher than the female mean. But the difference was not significant. A bit of a closer examination of the output reveals interesting coincidences. Note that the df for the t, the t-value itself, and the p value match what was reported for the regression coefficient in the above lm fit. This is not coincidence because the “association” question approached with regression is tantamount to the mean difference question posed by the independent samples t-test. Soon enough, we will progress through the technical reasons that these are the same inference.

For one final illustration in this chapter, we now build a regression model that adds gender to the earlier model that had degree_yrs and citations as IVs.

fit7b <- lm(salary~degree_yrs + cits + gender, data=cohen1)
summary(fit7b)

## 
## Call:
## lm(formula = salary ~ degree_yrs + cits + gender, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14214.6  -4590.3   -312.3   4014.9  21917.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38787.21    2478.78  15.648  < 2e-16 ***
## degree_yrs   1040.41     232.58   4.473 3.64e-05 ***
## cits          210.14      57.09   3.681 0.000512 ***
## gendermale    931.06    1859.70   0.501 0.618513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7088 on 58 degrees of freedom
## Multiple R-squared:  0.493,	Adjusted R-squared:  0.4667 
## F-statistic:  18.8 on 3 and 58 DF,  p-value: 1.227e-08

In this model gender is also not a significant IV and the regression coefficient is smaller than it was in simple regression indicating that any variance that it shared with salary was at least partially confounded with degree_yrs and cits.

One final comment about this type of modeling with a variable such as gender. It might be an interesting research question whether the predictability that IVs such as degree_yrs and cits have on salary differs in the two levels of a categorical variable such as gender. In other words, does the influence of degree_yrs and cits depend on whether we examine males or females? This is what we will come to call an interaction question. It can be modeled easily with lm. I will show the code/results for the three-IV model with all interactions as well as the results, but will save comment until we cover the interaction topic more explicitly.

fit7c <- lm(salary~degree_yrs*cits*gender, data=cohen1)
summary(fit7c)

## 
## Call:
## lm(formula = salary ~ degree_yrs * cits * gender, data = cohen1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13966.3  -4404.6   -773.2   3657.9  22513.1 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                26877.05    8716.17   3.084  0.00322 **
## degree_yrs                  3715.49    1591.34   2.335  0.02330 * 
## cits                         632.12     243.37   2.597  0.01208 * 
## gendermale                 11287.69   10484.27   1.077  0.28643   
## degree_yrs:cits              -89.18      45.11  -1.977  0.05315 . 
## degree_yrs:gendermale      -2442.09    1793.28  -1.362  0.17892   
## cits:gendermale             -403.30     275.46  -1.464  0.14897   
## degree_yrs:cits:gendermale    86.42      47.53   1.818  0.07457 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7051 on 54 degrees of freedom
## Multiple R-squared:  0.5328,	Adjusted R-squared:  0.4722 
## F-statistic: 8.798 on 7 and 54 DF,  p-value: 3.563e-07