3  Bivariate Correlation and Linear Regression

These bivariate analyses are done largely to reiterate the approaches we established earlier for two variable systems (i.e., simple regression and bivariate correlation). The R implementations are identical to what was covered earlier, except that with more variables, there are more pairs and more than one simple regression possible.

This review establishes some important information to compare to the approaches we took in SPSS, and allows us to remember the meaning of a simple regression regression coefficient and compare its value to coefficients found for the same variables in the multiple regression section of this document that follows this bivariate section.

The sequence of our approach here is to:

  1. obtain the variance-covariance matrix for our three variables.
  2. obtain the zero-order bivariate correlation matrix (Pearson product-moment correlations).
  3. and to test the null hypotheses that the three population correlations are zero.
  4. obtain the two different simple regressions of salary on each of the two IVs.
  5. graphically evaluate the assumptions of inferences in simple regression.

3.1 Bivariate computations: Covariances and Pearson product-moment correlations

Analyses requested here repeat/extend the implementation approach we learned with R for simple regression.

First, obtain variance-covariance and correlation matrices for the three variables of interest. Recall that we can submit a whole data frame or matrix of data to cov and cor. That is done here along with subsetting the data frame to use only the three variables for our defined analyses.

Show/Hide Code
cat(paste("cov produces a matrix with variances on the leading diagonal 
and covariances in the upper and lower triangles. 

"))
cov produces a matrix with variances on the leading diagonal 
and covariances in the upper and lower triangles. 
Show/Hide Code
cov(cohen1[,3:5])
             pubs       cits      salary
pubs     196.1155    80.1724    68797.68
cits      80.1724   294.8662    91628.81
salary 68797.6830 91628.8096 94206882.35
Show/Hide Code
corr1 <- cor(cohen1[,3:5], use="complete.obs", method="pearson")
cat(paste("
cor produces a symmetrical matrix of Pearson correlation coefficients
Here, they are rounded to two places. 

"))

cor produces a symmetrical matrix of Pearson correlation coefficients
Here, they are rounded to two places. 
Show/Hide Code
round(corr1,2)
       pubs cits salary
pubs   1.00 0.33   0.51
cits   0.33 1.00   0.55
salary 0.51 0.55   1.00

Now we will test each of the three zero-order correlations with the standard t-test and n-2 df:

Show/Hide Code
cor.test(pubs,salary, method = "pearson", alternative = "two.sided",
         exact = FALSE) 

    Pearson's product-moment correlation

data:  pubs and salary
t = 4.5459, df = 60, p-value = 2.706e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2934802 0.6710778
sample estimates:
      cor 
0.5061468 
Show/Hide Code
cor.test(pubs,cits, method = "pearson", alternative = "two.sided",
         exact = FALSE) 

    Pearson's product-moment correlation

data:  pubs and cits
t = 2.7392, df = 60, p-value = 0.008098
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.09122073 0.53833356
sample estimates:
      cor 
0.3333929 
Show/Hide Code
cor.test(cits,salary, method = "pearson", alternative = "two.sided",
         exact = FALSE) 

    Pearson's product-moment correlation

data:  cits and salary
t = 5.098, df = 60, p-value = 3.69e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3477491 0.7030024
sample estimates:
      cor 
0.5497664 

3.2 Test correlations with the corr.test function

We need a more efficient way of testing correlations from a whole correlation matrix is to use the corr.test and corr.p functions from the psych package. This method also provides the ability to adjust p values for multiple testing (using the Holm method), and to produce confidence intervals. The down side of using this function is that p values are rounded in appropriately, sometimes to zero, and without exponential notation. An alternative is the rcorr function from the Hmisc package, but it too has this p value rounding issue.

The commented adjust.corr function line from bcdstats will produce more interpretable output, especially with p values not rounded to zero.

Show/Hide Code
ct1 <- psych::corr.test(cohen1[,3:5])
print(psych::corr.p(ct1$r, n=62), short=FALSE)
Call:psych::corr.p(r = ct1$r, n = 62)
Correlation matrix 
       pubs cits salary
pubs   1.00 0.33   0.51
cits   0.33 1.00   0.55
salary 0.51 0.55   1.00
Sample Size 
[1] 62
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
       pubs cits salary
pubs   0.00 0.01      0
cits   0.01 0.00      0
salary 0.00 0.00      0

 Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
           lower    r upper    p
pubs-cits   0.09 0.33  0.54 0.01
pubs-salry  0.29 0.51  0.67 0.00
cits-salry  0.35 0.55  0.70 0.00
Show/Hide Code
#bcdstats::adjust.corr(cohen1[,3:5])

3.3 Simple regressions of Salary on each of the two IVs

Next, use the ‘lm’ function from R’s base installation to do linear modeling. Initially, lets do the two simple regressions of salary (the DV) on each of the IV’s separately. We do this for comparability to how we approached the multiple regression problem in conceptual development, but mostly to have the regression coefficients in hand in order to compare them to the regression coefficients for the IVs when included in the two-IV model.

First, we detach the cohen1 data frame so that we can use the “data” argument in the lm function, a better practice.

Show/Hide Code
detach(cohen1)
Show/Hide Code
#produce the two single-IV models
fit1 <- lm(salary~pubs, data=cohen1)
fit2 <- lm(salary~cits, data=cohen1)
# obtain relevant inferences and CI's for fit1 with pubs as IV
summary(fit1)

Call:
lm(formula = salary ~ pubs, data = cohen1)

Residuals:
     Min       1Q   Median       3Q      Max 
-20660.0  -7397.5    333.7   5313.9  19238.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 48439.09    1765.42  27.438  < 2e-16 ***
pubs          350.80      77.17   4.546 2.71e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8440 on 60 degrees of freedom
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.2438 
F-statistic: 20.67 on 1 and 60 DF,  p-value: 2.706e-05
Show/Hide Code
confint(fit1, level=0.95) # CIs for model parameters 
                2.5 %     97.5 %
(Intercept) 44907.729 51970.4450
pubs          196.441   505.1625
Show/Hide Code
anova(fit1)
Analysis of Variance Table

Response: salary
          Df     Sum Sq    Mean Sq F value    Pr(>F)    
pubs       1 1472195326 1472195326  20.665 2.706e-05 ***
Residuals 60 4274424497   71240408                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show/Hide Code
Anova(fit1,type="III") # note this is Anova, not anova
Anova Table (Type III tests)

Response: salary
                Sum Sq Df F value    Pr(>F)    
(Intercept) 5.3632e+10  1 752.831 < 2.2e-16 ***
pubs        1.4722e+09  1  20.665 2.706e-05 ***
Residuals   4.2744e+09 60                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show/Hide Code
# obtain relevant inferences and CI's for fit2 with cits as IV

summary(fit2)

Call:
lm(formula = salary ~ cits, data = cohen1)

Residuals:
     Min       1Q   Median       3Q      Max 
-16462.0  -5822.4   -884.1   5461.7  24096.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42315.71    2662.69  15.892  < 2e-16 ***
cits          310.75      60.95   5.098 3.69e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8175 on 60 degrees of freedom
Multiple R-squared:  0.3022,    Adjusted R-squared:  0.2906 
F-statistic: 25.99 on 1 and 60 DF,  p-value: 3.69e-06
Show/Hide Code
confint(fit2, level=0.95) # CIs for model parameters 
                 2.5 %     97.5 %
(Intercept) 36989.5395 47641.8738
cits          188.8201   432.6741
Show/Hide Code
anova(fit2)
Analysis of Variance Table

Response: salary
          Df     Sum Sq    Mean Sq F value   Pr(>F)    
cits       1 1736876419 1736876419   25.99 3.69e-06 ***
Residuals 60 4009743405   66829057                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show/Hide Code
Anova(fit2,type="III")# note this is Anova, not anova
Anova Table (Type III tests)

Response: salary
                Sum Sq Df F value    Pr(>F)    
(Intercept) 1.6878e+10  1  252.56 < 2.2e-16 ***
cits        1.7369e+09  1   25.99  3.69e-06 ***
Residuals   4.0097e+09 60                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of findings from bivariate analyses/inferences

  1. Notice in the above analyses that each regression coefficient has a t value equivalent to what was found for the analogous bivariate correlation test. Not a coincidence - make sure you still recall why this is the case.
  2. In addition, the R-squareds reported by the ‘summary’ function are the square of the zero-order correlations and the F tests are the squares of the t values for testing the regression coefficients.
  3. The ANOVA tables produced by anova and Anova do not give SS total. This is simply SS for Salary, the DV, and can be found other ways (e.g. var(salary)*(n-1)). This approach is used below in the multiple regression section.
  4. The SS regression for each of the fits is labeled as a SS for that IV and not called SS regression as we have done previously. Make sure that understand this equivalence and the labeling choice made in R.
  5. Compare the SS for each IV in the anova table with the comparable value in the Anova table. Ignoring the fact that ‘Anova’ uses exponential notation, these SS regression values are the same for the two different ANVOA functions. This is the expected outcome in simple regression. But they will not always be equivalent for a particular IV in multiple regression. We explore this below where the differences between anova and Anova become important in multiple regression.
  6. The SS residual value for each IV is the same in the anova and Anova output for each of the two fits. Again, this is expected in simple regression because the SS regression and SS residual must sum to SS total. This point is emphasized here because it is one area where numerical and conceptual differences occur in multiple regression (see below), compared to these simple regressions.
  7. Finally, notice that the F test values obtained by either the anova function or the Anova function (upper/lower case A) are identical. This is established now for comparison to what happens when we have two IVs below.

3.3.1 Diagnostic plots for simple regressions

Next, we want to obtain the base set of diagnostic plots that R makes available for ‘lm’ objects. This is the same set that we reviewed in the simple regression section of the course plus an additional normal probability plot of the residuals.

First for fit1:

Show/Hide Code
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(fit1)

I prefer to draw the qqplot with the qqPlot function from the car package since it provides confidence bounds. See Fox and Weinberg (2011).

Show/Hide Code
car::qqPlot(fit1, distribution="norm", id=F)

Next, the same plots for fit2:

Show/Hide Code
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(fit2)

And the preferred qqplot for fit2:

Show/Hide Code
qqPlot(fit2, distribution="norm", id=FALSE)

Since the tests of these two simple regression fits would not typically be used, the diagnostic plots would also not typically be of interest. The analyses seen above are largely here to put in place approaches and numeric values to compare the more interesting multiple regression results found below.