GeoCat333
GeoCat333

Reputation: 79

Why is my random forest regression predicting values not found in my training set?

I have a linear regression random forest model predicting plant height from a set of variables.

training <- read.csv('/sers/me/Desktop/training_data.csv')

rf_model <- randomForest(height ~ EVI + NDVI + Annual_Mean_Temperature + Annual_Precipitation + Precipitation_of_Wettest_Month, data = training, importance=TRUE, na.action = na.roughfix)

But when I look at the predicted values I see some negative numbers, despite that there are no negative values in my training dataset for the dependent variable -- as I'm predicting plant height, a negative value is physically impossible.

> min(rf_model$predicted)
  -4.433786671143025159836e-12

I've checked my training set and there are no negative values here, so how can this be / what should I do?

> min(training$height)
  0

Upvotes: 1

Views: 2048

Answers (1)

Walker Harrison
Walker Harrison

Reputation: 537

First, the negative number you listed there is extremely small and is equal to 0 even if you don't round until the 11th or 12th decimal place, so you probably could just treat that fitted value as 0.

Second, without some sort of transformation, the range for a response variable in linear regression is the entire real line. The coefficients are chosen based on what minimizes the loss function (sum of squares in the basic case), so the model doesn't really care if it produces fitted values that aren't in the exact same range as the original response.

Take this misspecified model for example. We know the data-generating process requires Y to be positive, but a simple linear model will create negative fitted values in an effort to draw the best line through the data:


set.seed(0)
n <- 1000
x <- rnorm(n)
y <- exp(x + rnorm(n))

data.frame(x, y) %>%
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = 'lm')

enter image description here

In order to restrict the range of your response, you can transform it, which is the idea behind GLMs. For example, if you take the logarithm of your response variable and then fit the model, you will have to exponentiate the resulting fitted values to get them back on the original scale, which guarantees that they are positive.

Upvotes: 1

Related Questions