16 Regression

Regression is a common statistical technique employed by many to make generalizations as well as predictions about data.

Note

The purpose of this chapter is to give readers a high-level overview of how to apply regression techniques in R rather than to give a full introduction to regression itself. However, there are multiple comprehensive resources in the resources section for interested readers.

16.1 Linear Regression

The first kind of regression we’ll cover is linear regression. Linear regression will use your data to come up with a linear model that describes the general trend of your data. Generally speaking, a linear model will consist of a dependent variable (y), at least one independent variable (x), coefficients to go along with each independent variable, and an intercept. Here’s one common linear model you may remember:

\[ y = mx + b \]

This is a simple linear model many people begin with where x and y are the independent and dependent variables, respectively, m is the slope (or coefficient of x), and b is the intercept.

To perform linear regression in R, you’ll use the “lm” function. Let’s try it out on the “faithful” dataset.

head(faithful)

eruptions	waiting
3.600	79
1.800	54
3.333	74
2.283	62
4.533	85
2.883	55

The “lm” function will accept at least two parameters which represent “y” and “x” in this format:

lm(y ~ x)

Let’s try this out by setting the y variable to eruptions and the x variable to waiting. We can then view the output of our linear model by using the “summary” function.

lm <- lm(faithful$eruptions ~ faithful$waiting)
summary(lm)


Call:
lm(formula = faithful$eruptions ~ faithful$waiting)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.874016   0.160143  -11.70   <2e-16 ***
faithful$waiting  0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

This summary will show us the statistical significance of our model along with all relevant statistics to correctly interpret the significance. Additionally, we now have our model coefficients. From this summary we can assume that our model looks something like this:

\[ eruptions = waiting * 0.075628 - 1.874016 \]

Let’s break down everything that the model summary returns.

The “Call” section calls the model that you created
The “Residuals” section gives you a summary of all of your model residuals. Simply put, a residual denotes how far away any given point falls from the predicted value.
The “Coefficients” section gives us our model coefficients, our intercept, and statistical values to determine their significance.
- For each coefficient, we are given the respective standard error. The standard error is used to measure the precision of coefficient’s estimate.
- Next, we have a t value for each coefficient. The t value is calculated by dividing the coefficient by the standard error.
- Finally, you have the p value accompanied by symbols to denote the corresponding significance level.
The residual standard error gives you a way to measure the standard deviation of the residuals and is calculated by dividing residual sum of squares by the residual degrees of freedom and taking the square root of that where the residual degrees of freedom is equal to total observations - total model parameters - 1.
R-squared gives you the proportion of variance that can be explained by your model. Your adjusted R-squared statistic will tell you the same thing but will adjust for the number of variables you’ve included in your model.
Your F-statistic will help you to understand the probability that all of your model parameters are actually equal to zero.

16.2 Multiple Regression

If you had more x variables you wanted to add to your linear model, you could add them just like you would in any other math equation. Here’s an example:

lm(data$y ~ data$x1 + data$x2 + data$x3 + data$x4)

Additionally, you can use the “data” parameter rather than putting the name of the dataset before every variable.

lm(y ~ x1 + x2 + x3 + x4, data = data)

Let’s try a real example with the mtcars dataset.

head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Now, let’s try to predict mpg and use every other column as a variable then see what the results look like.

lm <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
            , data = mtcars)
summary(lm)


Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
    am + gear + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

From here, you would likely tweak your model further based on the significance statistics we see here; however, that’s outside the scope of what we’re doing in this book. Take a look in the resources section at the end of this chapter to dive deeper into developing regression models.

16.3 Logistic Regression

Logistic regression is commonly used when your dependent variable (y) binomial (0 or 1). Instead of using the “lm” function though, you will use the “glm” function. Let’s try this out on the mtcars dataset again but this time with “am” as the dependent variable.

glm <- glm(am ~ cyl + hp + wt, family = binomial, data = mtcars)
summary(glm)


Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.17272  -0.14907  -0.01464   0.14116   1.27641  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) 19.70288    8.11637   2.428   0.0152 *
cyl          0.48760    1.07162   0.455   0.6491  
hp           0.03259    0.01886   1.728   0.0840 .
wt          -9.14947    4.15332  -2.203   0.0276 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.2297  on 31  degrees of freedom
Residual deviance:  9.8415  on 28  degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

16.4 Resources

“Lecture 9 - Linear regression in R” by Professor Alexandra Chouldechova at Carnegie Mellon University: https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture09/lecture09-94842.html
“Logistic Regression” by Erin Bugbee and Jared Wilber: https://mlu-explain.github.io/logistic-regression/
“Visualizing OLS Linear Regression Assumptions in R” by Trevor French https://medium.com/trevor-french/visualizing-ols-linear-regression-assumptions-in-r-e762ba7afaff