head(faithful)
16 Regression
Regression is a common statistical technique employed by many to make generalizations as well as predictions about data.
16.1 Linear Regression
The first kind of regression we’ll cover is linear regression. Linear regression will use your data to come up with a linear model that describes the general trend of your data. Generally speaking, a linear model will consist of a dependent variable (y), at least one independent variable (x), coefficients to go along with each independent variable, and an intercept. Here’s one common linear model you may remember:
\[ y = mx + b \]
This is a simple linear model many people begin with where x and y are the independent and dependent variables, respectively, m is the slope (or coefficient of x), and b is the intercept.
To perform linear regression in R, you’ll use the “lm” function. Let’s try it out on the “faithful” dataset.
eruptions | waiting |
---|---|
3.600 | 79 |
1.800 | 54 |
3.333 | 74 |
2.283 | 62 |
4.533 | 85 |
2.883 | 55 |
The “lm” function will accept at least two parameters which represent “y” and “x” in this format:
lm(y ~ x)
Let’s try this out by setting the y variable to eruptions and the x variable to waiting. We can then view the output of our linear model by using the “summary” function.
<- lm(faithful$eruptions ~ faithful$waiting)
lm summary(lm)
Call:
lm(formula = faithful$eruptions ~ faithful$waiting)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
faithful$waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
This summary will show us the statistical significance of our model along with all relevant statistics to correctly interpret the significance. Additionally, we now have our model coefficients. From this summary we can assume that our model looks something like this:
\[ eruptions = waiting * 0.075628 - 1.874016 \]
Let’s break down everything that the model summary returns.
- The “Call” section calls the model that you created
- The “Residuals” section gives you a summary of all of your model residuals. Simply put, a residual denotes how far away any given point falls from the predicted value.
- The “Coefficients” section gives us our model coefficients, our intercept, and statistical values to determine their significance.
- For each coefficient, we are given the respective standard error. The standard error is used to measure the precision of coefficient’s estimate.
- Next, we have a t value for each coefficient. The t value is calculated by dividing the coefficient by the standard error.
- Finally, you have the p value accompanied by symbols to denote the corresponding significance level.
- The residual standard error gives you a way to measure the standard deviation of the residuals and is calculated by dividing residual sum of squares by the residual degrees of freedom and taking the square root of that where the residual degrees of freedom is equal to total observations - total model parameters - 1.
- R-squared gives you the proportion of variance that can be explained by your model. Your adjusted R-squared statistic will tell you the same thing but will adjust for the number of variables you’ve included in your model.
- Your F-statistic will help you to understand the probability that all of your model parameters are actually equal to zero.
16.2 Multiple Regression
If you had more x variables you wanted to add to your linear model, you could add them just like you would in any other math equation. Here’s an example:
lm(data$y ~ data$x1 + data$x2 + data$x3 + data$x4)
Additionally, you can use the “data” parameter rather than putting the name of the dataset before every variable.
lm(y ~ x1 + x2 + x3 + x4, data = data)
Let’s try a real example with the mtcars dataset.
head(mtcars)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Now, let’s try to predict mpg and use every other column as a variable then see what the results look like.
<- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
lm data = mtcars)
, summary(lm)
Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
am + gear + carb, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
From here, you would likely tweak your model further based on the significance statistics we see here; however, that’s outside the scope of what we’re doing in this book. Take a look in the resources section at the end of this chapter to dive deeper into developing regression models.
16.3 Logistic Regression
Logistic regression is commonly used when your dependent variable (y) binomial (0 or 1). Instead of using the “lm” function though, you will use the “glm” function. Let’s try this out on the mtcars dataset again but this time with “am” as the dependent variable.
<- glm(am ~ cyl + hp + wt, family = binomial, data = mtcars)
glm summary(glm)
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
16.4 Resources
“Lecture 9 - Linear regression in R” by Professor Alexandra Chouldechova at Carnegie Mellon University: https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture09/lecture09-94842.html
“Logistic Regression” by Erin Bugbee and Jared Wilber: https://mlu-explain.github.io/logistic-regression/
“Visualizing OLS Linear Regression Assumptions in R” by Trevor French https://medium.com/trevor-french/visualizing-ols-linear-regression-assumptions-in-r-e762ba7afaff