Unit 10. Simple Linear Regression in R
Leyre Castro
Summary. In this unit, we will explain how to conduct a simple linear regression analysis with R, and how to read and interpret the output that R provides.
Prerequisite Units
Unit 5. Statistics with R: Introduction and Descriptive Statistics
Unit 6. Brief Introduction to Statistical Significance
Unit 9. Simple Linear Regression
Simple linear regression analysis in R
Whenever we ask some statistical software to perform a linear regression, it uses the equations that we described above to find the best fit line, and then shows us the parameter estimates obtained. Let’s see here how to conduct a simple linear regression analysis with R.
We will use the dataset of 50 participants with different amount of experience (from 1 to 16 weeks) in performing a computer task, and their accuracy (from 0 to 100% correct) in this task (the same dataset that we used in Unit 8). Thus, Experience is the predictor variable, and Accuracy is the outcome variable.
In the script below, the first line imports our dataset by using the read.table function, and assigns the data file to the object MyData.
#read data file
MyData <- read.table(“ExperienceAccuracy.txt”, header = TRUE, sep = “,”)
#fit linear model
model <- lm(Accuracy ~ Experience, data = MyData)
summary(model)
#see the residuals plot
plot(model)
To conduct the linear regression in which we will model Accuracy as a function of Experience, we use the lm() function in R, as you can see in the second line of code in the script. As usual in R, we assign our linear regression model to an object that we will call model. Remember that R works with objects, and that this name is arbitrary (you could call the object to which the linear regression model is assigned seewhatIgot or mybestanalysis), although it is convenient to use a name relatively simple and relevant for your task at hand. Within the parenthesis, you first include your Y variable, Accuracy in this case, and then your X variable, Experience in this case, connected by the ~ symbol. This reads as “Accuracy as a function of Experience.” Then, you indicate where your data are; here, you include the object that you created for the dataset that you are working with, MyData.
In the next line, you ask to see the output for the linear regression analysis, using the summary() on the model. And this is what you will obtain:
lm(formula = Accuracy ~ Experience, data = MyData)Residuals:
Min 1Q Median 3Q Max
-8.7014 -2.4943 0.6176 2.7505 6.6814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.7872 1.1115 51.99 <2e-16 ***
Experience 2.1276 0.1124 18.93 <2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.705 on 48 degrees of freedom
Multiple R-squared: 0.8818, Adjusted R-squared: 0.8794
F-statistic: 358.2 on 1 and 48 DF, p-value: < 2.2e-16
* Call: The first item shown in the output is the formula R used to fit the data (that is, the formula that you typed to request the linear regression analysis).
* Residuals: Here, the error or residuals are summarized. The smaller the residuals, the smaller the difference between the data that you obtained and the predictions of the model. The Min and Max tell you the largest negative (below the regression line) and positive (above the regression line) errors. Given that our accuracy scale is from 0 to 100, the largest positive error being 6.68 and the largest negative error being -8.70 do not seem terribly large errors. Importantly, we can see that the median has a value of 0.61, very close to 0, and the first and third quartiles (1Q and 3Q) are approximately the same distance from the center. Therefore, the distribution of residuals seems to be fairly symmetrical.
* Coefficients: This is the critical section in the output. For the intercept and for the predictor variable, you get an estimate that comes along with a standard error, a t-value, and the significance level.
The Estimate for the intercept or b0 is the estimation of the analysis for the value of Y when X is 0; that is the predicted accuracy level (57.78% here) of someone who has 0 weeks of experience with the task. Remember that you should only interpret the intercept if zero is within or very close to the range of values for your predictor variable, and talking about a zero value for your predictor variable makes sense. This may be a bit tricky in our example. The minimum amount of experience with the computer task among our participants is 1 week, so the zero value seems to be close enough. However, realize that someone with zero experience with the task may not know what to do and may not be able to perform the task at all. Thus, the value for the intercept may not make sense. This is something that you have to evaluate when you know your methods and procedures well.
The Estimate for our predictor variable, Experience, appears below: 2.12. This is b1 or the estimated slope coefficient or, simply, the slope for the linear regression. Remember that the slope tells us the amount of expected change in Y for each one-unit change in X. Here, we would say that for each additional week of experience with the task, accuracy is expected to improve 2.12 points.
Pay attention to the sign of the estimate for the predictor. If it is positive, then it represents the expected increase in the outcome variable by each one-unit increase in the predictor variable. If it is negative, then it represents the expected decrease in the outcome variable by each one-unit increase in the predictor variable.
The Std. Error (standard error), tells you how precisely each of the estimates was measured. We want, ideally, a lower number relative to its coefficients. The standard error is important for calculating the t-value. The t-value is calculated by taking the estimate for the coefficient and dividing it by the standard error (for example, the t-value of 18.93 in the Experience line is 2.12 / 0.11. This t-value is then used to test whether or not the estimate for the coefficient is significantly different from zero. For example, if the coefficient b1 is not significantly different from zero, it means that the slope of the regression line is close to being flat, so changes in the predictor variable are not related to changes in the outcome variable.
Pr(>|t|) refers to the significance level on the t-test. Remember that the cut-off value for a hypothesis test to be statistically significant is 0.05 (see Unit 6), so that if the p-value is less than 0.05, then the result is statistically significant. Here, the p-value is very small, and R uses the scientific notation for very small quantities, and that’s why you see the e in the number. You just need to know that this is a tiny value. For values larger than .001, the exact number will appear. The asterisks next to the p-values indicate the magnitude of the p-value (* for < .05, ** for < .01, and *** for < .001 as described in the Signif. codes line).
* At the bottom of the output, you have some measures that also help to evaluate the linear regression model. We have not seen yet what many of those elements mean so, for the time being, just note that R2 (see Unit 7) is included here, indicating the amount of variance in Accuracy that is explained by Experience. R2 always lies between 0 and 1. An R2 of 0 means that the predictor variable provides no information about the outcome variable, whereas an R2 of 1 means that the predictor variable allows perfect prediction of the outcome variable, with every point of the scatterplot exactly on the regression line. Anything in between represents different levels of closeness of the scattered points around the regression line. Here, we obtained an R2 of 0.88, so you could say that Experience explains 88% of the variance in Accuracy. The difference between multiple and adjusted R2 is negligible in this case, given that we only have on predictor variable. Adjusted R2 takes into account the number of predictor variables and is more useful in multiple regression analyses.
See video with a simple linear regression analysis being conducted in R