j.cube
j.cube

Reputation: 15

R: Finding the coefficients of an expression which produce the largest R-squared value?

Let's say I've got a data inputted into a data frame like so:

df = data.frame(x = c(1,2,3,4,5,10,15,25,50), 
                y = c(.57,.75,.82,0.87,.89,.95,.97,.98,.99))
df

and I wish to fit the expression:

y = ((x/a)^b)/(1+(x/a)^b)

where a and b are unknown parameters.

I have plotted the points and drawn a fitted line by guessing the values of a and b:

library(ggplot2)
graph <- ggplot(df, aes(x=x, y=y))
graph <- graph + geom_point()

a = 0.50
b = 1.00
guesstimate <- function(x){((x/a)^b)/(1+(x/a)^b)}
graph <- graph + stat_function(fun = guesstimate)
graph

However, I'd like to find the values of a and b which creates an expression that produces the highest R^2 square value; i.e. the best possible mathematical fit for the data possible.

Question: Short of guessing through the values of a and b manually and checking with the naked eye which fit is best, is there a way to get R to find the 'best' a and b values along with providing the R-squared value which confirms to me that the chosen a and b values are indeed the best possible fit?

Upvotes: 1

Views: 296

Answers (2)

nicola
nicola

Reputation: 24510

Even if not obvious, a linear model can be applied here, just using basic algebra. Indeed, starting from 1/y = (1+(x/a)^b)/(x/a)^b and a little manipulation, you can arrive to:

log(1/y - 1) = -b*log(x) + b*log(a)

which is basically a linear model in the variables Y = log(1/y - 1) and X = log(x). From here, you can use lm:

df2<-data.frame(Y = log(1/df$y - 1), X = log(df$x))
coeffs<-lm(Y ~ X, data=df2)$coefficients
a <-  exp(-model[1]/model[2])
# 0.7491387
b <- -model[2]
#1.116111

which are similar to those obtained with nls.

Upvotes: 1

eipi10
eipi10

Reputation: 93871

You can use the nls (non-linear least squares) function:

m1 = nls(y ~ (x/a)^b/(1+(x/a)^b), list(a=1, b=1), data=df)

summary(m1)
Formula: y ~ (x/a)^b/(1 + (x/a)^b)

Parameters:
  Estimate Std. Error t value Pr(>|t|)    
  a 0.779291   0.009444   82.51 1.01e-11 ***
  b 1.145174   0.012733   89.94 5.53e-12 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003086 on 7 degrees of freedom

Number of iterations to convergence: 4 
Achieved convergence tolerance: 5.949e-08
ggplot(df, aes(x,y)) + 
  geom_point() +
  geom_line(data=data.frame(x=seq(1,50,0.1), y=predict(m1, newdata=list(x=seq(1,50,0.1)))), 
            aes(x,y), colour="red")

enter image description here

nls does not provide an r-squared value, because, as discussed in this thread on R-help, r-squared is not necessarily meaningful for a non-linear model. nls does, however, find the parameter values that minimize the residual sum-of-squares, so in that sense these parameters provide the best fit for the given data and model. That doesn't mean that there isn't another model specification that gives a better fit, though in this case the model fit is virtually perfect.

Upvotes: 2

Related Questions