pyring
pyring

Reputation: 357

Nonlinear regression with a discrete independent variable

It turns out that I have two variables that do not satisfy the assumption of linearity. The dependent variable is continuous and the independent variable is numeric and discrete. Here the residual plot and a box and whisker plot: enter image description here enter image description here Therefore, I can not use a linear regression. I've tried unsuccessfully to linearize the relationship between the variables by transforming the data (by doing log(y); log(x); sqrt(y); etc...). I did not find any transformation that satisfactorily increased the linearity. Thus, I have ended up trying to fit non-linear functions to my data (It is a totally unknown field for me, and I have not been able to find much information on the internet). I would therefore like to know if I am taking the correct steps for my nonlinear analysis:

1) First thing I did was to choose a quadratic polynomial

y = a + (b*x) + c*(x^2)

Here is my first doubt (what function to use?) since I know there are infinitely many different functions that could describe the same line.

2) Second thing I did was to estimate the parameters by using the non-linear least squares approach (function nls in R) which basically approximate the non-linear function using a linear one and iteratively try to find the best parameter values:

m <- nls(y ~ a + (b*x) + c*(x^2), start= list(a = 2, b = 1, c=1))

To choose the initial values (start) for the parameters I tried to choose initial values that are close to the expected final solution. Although I have found that even if I change the initial values, the final result remains the same. Here are the estimated values for the parameters after the non-linear least squares approach:

Nonlinear regression model
  model: y ~ a + (b * x) + c * (x^2)
   data: parent.frame()
      a       b       c 
 2.1296 -0.9395 -1.1754 
 residual sum-of-squares: 27615

Number of iterations to convergence: 1 
Achieved convergence tolerance: 2.58e-08

3) Finally, I check the results by doing summary(m):

Formula: y ~ a + (b * x) + c * (x^2)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a  2.129555   0.003976  535.56   <2e-16 ***
b -0.939467   0.018500  -50.78   <2e-16 ***
c -1.175413   0.017818  -65.97   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5114 on 105597 degrees of freedom

Number of iterations to convergence: 1 
Achieved convergence tolerance: 2.58e-08

I also run a little piece of code to get some model evaluation:

RSS <- sum(residuals(m)^2)
TSS <- sum((y - mean(y))^2)
R.square <- 1 - (RSS/TSS)

Result:

R.square 
0.6365729

I do not know if my procedure was orthodox. I would like to know if I have taken the right steps. I also appreciate some clue to know what is the best way to interpret the stats and report these results.

Upvotes: 2

Views: 1430

Answers (2)

Dan Sp.
Dan Sp.

Reputation: 1447

The point of linear regression is to approximate your nonlinear data with a line, not start with linear data. Looking at your red dots above it looks like you need a quadratic regression. The equations you can program can be found over at math.stackexchange. Quadratic Regression equations

Upvotes: 2

ndrearu
ndrearu

Reputation: 164

Actually, using a polynomial is a case of linear regression, since linear is referred to the dependence of the fit parameter and not to the independent variable. The form you have to use is up to you. However, your data seem to lie to zero when x grows, so I wouldn't use a polynomial but something like a long-tailed distribution. Then, it is OK if your final estimate of the parameters does not depend on the initial guess (maybe I'm wrong, but I remember that this should be always true in the case of linear regression, while in with non-linear cases there may be many minima in the sum of squared residuals). Finally, be careful with the R^2: it must be used only in the case of a linear regression (and this is your case), but it is completely useless and meaningless if you perform a non-linear fit.

Upvotes: 1

Related Questions