Reputation: 79
I have a linear regression random forest model predicting plant height from a set of variables.
training <- read.csv('/sers/me/Desktop/training_data.csv')
rf_model <- randomForest(height ~ EVI + NDVI + Annual_Mean_Temperature + Annual_Precipitation + Precipitation_of_Wettest_Month, data = training, importance=TRUE, na.action = na.roughfix)
But when I look at the predicted values I see some negative numbers, despite that there are no negative values in my training dataset for the dependent variable -- as I'm predicting plant height, a negative value is physically impossible.
> min(rf_model$predicted)
-4.433786671143025159836e-12
I've checked my training set and there are no negative values here, so how can this be / what should I do?
> min(training$height)
0
Upvotes: 1
Views: 2048
Reputation: 537
First, the negative number you listed there is extremely small and is equal to 0 even if you don't round until the 11th or 12th decimal place, so you probably could just treat that fitted value as 0.
Second, without some sort of transformation, the range for a response variable in linear regression is the entire real line. The coefficients are chosen based on what minimizes the loss function (sum of squares in the basic case), so the model doesn't really care if it produces fitted values that aren't in the exact same range as the original response.
Take this misspecified model for example. We know the data-generating process requires Y to be positive, but a simple linear model will create negative fitted values in an effort to draw the best line through the data:
set.seed(0)
n <- 1000
x <- rnorm(n)
y <- exp(x + rnorm(n))
data.frame(x, y) %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = 'lm')
In order to restrict the range of your response, you can transform it, which is the idea behind GLMs. For example, if you take the logarithm of your response variable and then fit the model, you will have to exponentiate the resulting fitted values to get them back on the original scale, which guarantees that they are positive.
Upvotes: 1