Reputation: 196
In many fields of the natural sciences it is common practice to report the results of linear regression analysis as y = (a1 +- u(a1)) + (a2 +- u(a2)) * x
, including R2 and p, but not the original data. u(a1) and u(a2) are the uncertainties (standard error) of a1 and a2. How could I calculate with this information the confidence and prediction intervals, or have a “reasonable” estimation?
Let me clarify with an example. This is a dummy dataset, with a line of slope 1 and gaussian noise 10:
set.seed(1961)
npoints <- 1e2
(x <- 1:npoints)
(y <-1:npoints + rnorm(npoints, 0, npoints/10))
Now I perform the linear regression:
par(mar = c(4, 4, 1, 1))
xy.model <- lm(y ~ x)
plot(x, y, pch = 16)
abline(xy.model, col = "orange", lwd = 2)
(xy.sum <- summary(xy.model))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.28106 1.94918 -0.657 0.513
# x 1.00484 0.03351 29.987 <2e-16 ***
# Residual standard error: 9.673 on 98 degrees of freedom
# Multiple R-squared: 0.9017, Adjusted R-squared: 0.9007
# F-statistic: 899.2 on 1 and 98 DF, p-value: < 2.2e-16
I calculate the confidence and prediction intervals:
x.new <- data.frame(x = 1:npoints)
xy.conf <- predict.lm(xy.model, se.fit = TRUE, interval = "confidence", newdata = x.new)
xy.pred <- predict.lm(xy.model, se.fit = TRUE, interval = "prediction", newdata = x.new)
For example, the confidence and prediction intervals of the first point are:
xy.conf$fit[1, ]
# fit lwr upr
# -0.2762127 -4.0867009 3.5342755
xy.pred$fit[1, ]
# fit lwr upr
# -0.2762127 -19.8462821 19.2938568
If the regression equation is reported as y = (-1.28106 +- 1.94918) + (1.00484 +- 0.03351) * x, R2 = 0.9017, p < 0.05, but the original data are not provided, how can I reproduce (at least approximately) the values, of the confidence and prediction intervals?
Upvotes: 0
Views: 588
Reputation: 11056
Without the original data you need one more piece of information: the means of the two variables. The statistics you provide allow the construction of the linear regression line, but the confidence and prediction bands are narrowest at mean(x), mean(y) so without those you cannot compute them.
A simple example may make this clearer. Start with some data:
z <- structure(list(x = c(5, 5.1, 5.4, 5.8, 4.7, 5.7, 4.8, 5.1, 4.6,
5.4, 5.2, 5, 5, 5.5, 5.2, 5.1, 4.7, 5.2, 4.8, 5.4, 4.8, 5.1,
5, 4.6, 4.8), y = c(3.4, 3.7, 3.4, 4, 3.2, 3.8, 3, 3.5, 3.1,
3.7, 4.1, 3.4, 3.6, 4.2, 3.5, 3.3, 3.2, 3.4, 3, 3.9, 3.1, 3.5,
3.5, 3.4, 3.4)), row.names = c(NA, -25L), class = "data.frame")
Compute the regression line and plot it along with the data:
z.lm <- lm(y~x, z)
z.lm
#
# Call:
# lm(formula = y ~ x, data = z)
#
# Coefficients:
# (Intercept) x
# -0.4510 0.7762
#
plot(y~x, z, xlim=c(0, 20), ylim=c(0, 20))
abline(z.lm)
Now create a new data set from the original data and compute the regression:
x2 <- z$x + 10
y2 <- z$y+(10 * coef(z.lm)[2])
z2 <- data.frame(x=x2, y=y2)
points(y~x, z2, col="red")
z2.lm <- lm(y~x, z2)
z2.lm
#
# Call:
# lm(formula = y ~ x, data = z2)
#
# Coefficients:
# (Intercept) x
# -0.4510 0.7762
Notice the regression coefficients are the same as the original data. In fact changing 10 to any other value will produce another set of data with the same regression results.
Upvotes: 1