Reputation: 5
I used a linear model to obtain the best fit to my data, lm() function. From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data (y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Upvotes: 0
Views: 984
Reputation: 270195
If measured
and modelled
are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm
is the lm
object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured
and modelled
but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE)
should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured
and modelled
if you have the lm
object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2
data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16
Upvotes: 1
Reputation: 46978
As mentioned in the comments, you can use the lm()
function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled
is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
Upvotes: 2
Reputation: 174476
Let's be clear what you are asking here. You have an existing model, which is "the modelled
values are the expected value of the measured
values", or in other words, measured = modelled + e
, where e
are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e
?"
Trying to coerce lm
into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e
, you just need to know the proportion of the variance of measured
that is explained by modelled
. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm
call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled
, in that the slope in the lm
model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
Upvotes: 2