Reputation: 75585
Suppose I have an input feature vector containing 10 input features, each with order of magnitude around 1E-7
.
When I run linear regression with the log
of these input features, I get an R^2
value of around 0.98
.
However, if I add 1E-2
to each of my input features before running through the above fit, I get an R^2
value of 0.5616
.
The problem is that I will not know a priori that the constant that was added to my input features was 1E-2
, so I cannot simply subtract off that quantity every time.
Is there a general way to correct for a large, unknown constant added to my input feature set?
Here is a sample input file: http://stanford.edu/~hq6/13
Here is a corresponding output file: http://stanford.edu/~hq6/15
Here is some code that is used for training:
input_features = read.csv('InputFeatures.csv', header=F)
# Adding constant error term to all input features
input_features = input_features + 1E-2
# How can we correct for this constant if we do not know what the constant is beforehand?
input_features[input_features <= 0] = 1E-10
input_features = log(input_features)
output = read.csv('Output.csv', header=F)
full_data = data.frame(input_features, output)
summary(lm(V1.1 ~ ., data=full_data))
When this code is run without the line input_features = input_features + 1E-2
, I get an R-squared
of approximately 0.98
from the summary
output.
When this line is put in, then the R-squared
of less than 0.5
.
Upvotes: 0
Views: 217
Reputation: 21532
So you're suggesting your dataset fits y = A + B*exp(C*x)
. Why not do a direct fit using nls
or other nonlinear fitting tools?
If you wish to do a linear fit to the log of both sides, it should be obvious from the rules of logarithms (e.g. log(ab) = log(a) + log(b) ) that you cannot separate out the effect of two summed terms.
Upvotes: 1
Reputation: 66850
Linear regression on the R^10 results in 11 real numbers being coefficients of the 10-dimensional hyperplane. From your post it seems that you have one ("value of ...") or at most two ("R^2") which still seems wrong.
Or maybe by R^2 you meant residuals error?
Linear regression itself is invariant to adding a constant, as long as it does not lead to some numerical imprecision and you add it to all your features. If you add to just one then it is quite obvious that it will change results - as this dimension may become more/less important (depending on the sign of the constant). In order to make it invariant to such operations you can normalize your data (by linearly scaling to the interval or normalizing to mean=0 and std=1)
Upvotes: 0