16  Regression

Regression is a common statistical technique employed by many to make generalizations as well as predictions about data.

Note

The purpose of this chapter is to give readers a high-level overview of how to apply regression techniques in R rather than to give a full introduction to regression itself. However, there are multiple comprehensive resources in the resources section for interested readers.

16.1 Linear Regression

The first kind of regression we’ll cover is linear regression. Linear regression will use your data to come up with a linear model that describes the general trend of your data. Generally speaking, a linear model will consist of a dependent variable (y), at least one independent variable (x), coefficients to go along with each independent variable, and an intercept. Here’s one common linear model you may remember:

\[ y = mx + b \]

This is a simple linear model many people begin with where x and y are the independent and dependent variables, respectively, m is the slope (or coefficient of x), and b is the intercept.

To perform linear regression in R, you’ll use the “lm” function. Let’s try it out on the “faithful” dataset.

head(faithful)
eruptions waiting
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55

The “lm” function will accept at least two parameters which represent “y” and “x” in this format:

lm(y ~ x)

Let’s try this out by setting the y variable to eruptions and the x variable to waiting. We can then view the output of our linear model by using the “summary” function.

lm <- lm(faithful$eruptions ~ faithful$waiting)
summary(lm)

Call:
lm(formula = faithful$eruptions ~ faithful$waiting)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.874016   0.160143  -11.70   <2e-16 ***
faithful$waiting  0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

This summary will show us the statistical significance of our model along with all relevant statistics to correctly interpret the significance. Additionally, we now have our model coefficients. From this summary we can assume that our model looks something like this:

\[ eruptions = waiting * 0.075628 - 1.874016 \]

Let’s break down everything that the model summary returns.

  • The “Call” section calls the model that you created
  • The “Residuals” section gives you a summary of all of your model residuals. Simply put, a residual denotes how far away any given point falls from the predicted value.
  • The “Coefficients” section gives us our model coefficients, our intercept, and statistical values to determine their significance.
    • For each coefficient, we are given the respective standard error. The standard error is used to measure the precision of coefficient’s estimate.
    • Next, we have a t value for each coefficient. The t value is calculated by dividing the coefficient by the standard error.
    • Finally, you have the p value accompanied by symbols to denote the corresponding significance level.
  • The residual standard error gives you a way to measure the standard deviation of the residuals and is calculated by dividing residual sum of squares by the residual degrees of freedom and taking the square root of that where the residual degrees of freedom is equal to total observations - total model parameters - 1.
  • R-squared gives you the proportion of variance that can be explained by your model. Your adjusted R-squared statistic will tell you the same thing but will adjust for the number of variables you’ve included in your model.
  • Your F-statistic will help you to understand the probability that all of your model parameters are actually equal to zero.

16.2 Multiple Regression

If you had more x variables you wanted to add to your linear model, you could add them just like you would in any other math equation. Here’s an example:

lm(data$y ~ data$x1 + data$x2 + data$x3 + data$x4)

Additionally, you can use the “data” parameter rather than putting the name of the dataset before every variable.

lm(y ~ x1 + x2 + x3 + x4, data = data)

Let’s try a real example with the mtcars dataset.

head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Now, let’s try to predict mpg and use every other column as a variable then see what the results look like.

lm <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
            , data = mtcars)
summary(lm)

Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
    am + gear + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

From here, you would likely tweak your model further based on the significance statistics we see here; however, that’s outside the scope of what we’re doing in this book. Take a look in the resources section at the end of this chapter to dive deeper into developing regression models.

16.3 Logistic Regression

Logistic regression is commonly used when your dependent variable (y) binomial (0 or 1). Instead of using the “lm” function though, you will use the “glm” function. Let’s try this out on the mtcars dataset again but this time with “am” as the dependent variable.

glm <- glm(am ~ cyl + hp + wt, family = binomial, data = mtcars)
summary(glm)

Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.17272  -0.14907  -0.01464   0.14116   1.27641  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) 19.70288    8.11637   2.428   0.0152 *
cyl          0.48760    1.07162   0.455   0.6491  
hp           0.03259    0.01886   1.728   0.0840 .
wt          -9.14947    4.15332  -2.203   0.0276 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.2297  on 31  degrees of freedom
Residual deviance:  9.8415  on 28  degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

16.4 Resources