JASC
JASC

Reputation: 196

Linear regression: calculate confidence and prediction intervals with the standard errors of the fitted parameters the correlation coefficient

In many fields of the natural sciences it is common practice to report the results of linear regression analysis as y = (a1 +- u(a1)) + (a2 +- u(a2)) * x, including R2 and p, but not the original data. u(a1) and u(a2) are the uncertainties (standard error) of a1 and a2. How could I calculate with this information the confidence and prediction intervals, or have a “reasonable” estimation?

Let me clarify with an example. This is a dummy dataset, with a line of slope 1 and gaussian noise 10:

set.seed(1961)
npoints <- 1e2
(x <- 1:npoints)
(y <-1:npoints + rnorm(npoints, 0, npoints/10))

Now I perform the linear regression:

par(mar = c(4, 4, 1, 1))
xy.model <- lm(y ~ x)
plot(x, y, pch = 16)
abline(xy.model, col = "orange", lwd = 2)
(xy.sum   <- summary(xy.model))
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -1.28106    1.94918  -0.657    0.513    
# x            1.00484    0.03351  29.987   <2e-16 ***
# Residual standard error: 9.673 on 98 degrees of freedom
# Multiple R-squared:  0.9017,  Adjusted R-squared:  0.9007 
# F-statistic: 899.2 on 1 and 98 DF,  p-value: < 2.2e-16

I calculate the confidence and prediction intervals:

x.new   <- data.frame(x = 1:npoints)
xy.conf <- predict.lm(xy.model, se.fit = TRUE, interval = "confidence", newdata = x.new)
xy.pred <- predict.lm(xy.model, se.fit = TRUE, interval = "prediction", newdata = x.new)

For example, the confidence and prediction intervals of the first point are:

xy.conf$fit[1, ] 
#        fit        lwr        upr 
# -0.2762127 -4.0867009  3.5342755
xy.pred$fit[1, ]
#       fit         lwr         upr 
# -0.2762127 -19.8462821  19.2938568

If the regression equation is reported as y = (-1.28106 +- 1.94918) + (1.00484 +- 0.03351) * x, R2 = 0.9017, p < 0.05, but the original data are not provided, how can I reproduce (at least approximately) the values, of the confidence and prediction intervals?

Upvotes: 0

Views: 588

Answers (1)

dcarlson
dcarlson

Reputation: 11056

Without the original data you need one more piece of information: the means of the two variables. The statistics you provide allow the construction of the linear regression line, but the confidence and prediction bands are narrowest at mean(x), mean(y) so without those you cannot compute them.

A simple example may make this clearer. Start with some data:

z <- structure(list(x = c(5, 5.1, 5.4, 5.8, 4.7, 5.7, 4.8, 5.1, 4.6, 
5.4, 5.2, 5, 5, 5.5, 5.2, 5.1, 4.7, 5.2, 4.8, 5.4, 4.8, 5.1, 
5, 4.6, 4.8), y = c(3.4, 3.7, 3.4, 4, 3.2, 3.8, 3, 3.5, 3.1, 
3.7, 4.1, 3.4, 3.6, 4.2, 3.5, 3.3, 3.2, 3.4, 3, 3.9, 3.1, 3.5, 
3.5, 3.4, 3.4)), row.names = c(NA, -25L), class = "data.frame")

Compute the regression line and plot it along with the data:

z.lm <- lm(y~x, z)
z.lm
# 
# Call:
# lm(formula = y ~ x, data = z)
# 
# Coefficients:
# (Intercept)            x  
#     -0.4510       0.7762  
# 
plot(y~x, z, xlim=c(0, 20), ylim=c(0, 20))
abline(z.lm)

Now create a new data set from the original data and compute the regression:

x2 <- z$x + 10
y2 <- z$y+(10 * coef(z.lm)[2])
z2 <- data.frame(x=x2, y=y2)
points(y~x, z2, col="red")
z2.lm <- lm(y~x, z2)
z2.lm
# 
# Call:
# lm(formula = y ~ x, data = z2)
# 
# Coefficients:
# (Intercept)            x  
#     -0.4510       0.7762  

Notice the regression coefficients are the same as the original data. In fact changing 10 to any other value will produce another set of data with the same regression results.

Regression line

Upvotes: 1

Related Questions