Using predict() with RcppArmadillo/RcppEigen when a factor level has only one level

Question

I have a question regarding the use of the predict() function with the RcppArmadillo and RcppEigen packages, when a factor variable has only one level. I built a MWE below using the iris dataset.

I want to first estimate a linear regression model using RcppArmadillo, and then use it to predict values. The data I use for estimation contains factor variables (with more than one level and no NA). The prediction I want to make is slightly unusual in one respect: I want to predict values using the same factor level for all observations (this level being in levels used in the estimation). In the example below, it means that I want to predict Sepal.Length as if all observations were from the "versicolor" Species.

This works well when I estimate the model using the lm() function, but does not work with the RcppArmadillo::fastLm() or RcppEigen::fastLm() functions. I get the following error: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels. The same error happens again if one of the factor levels is missing. I understand well why at least two levels are necessary for estimation, but I don't understand why having only one level is a problem for prediction once the model has been properly estimated.

The obvious solution would be to use lm() instead of fastLm(), but this is unfortunately impossible because my data is quite big. After some trial-and-error, I found this dirty work-around:

I stack two versions of the data: the first one is the original data (with all factor levels), the second one is the modified data (with the same factor level for all observations);
I predict values on this dataset (it works because all factor levels are present in this dataset);
I keep only the modified subset of data.

Does anyone have a better solution than this, or at least an explanation as to why this error exists?

library(data.table)

# Loading iris data
iris <- as.data.table(iris)

# Estimating the model
model <-
  RcppArmadillo::fastLm(Sepal.Length ~ 
                 factor(Species)
               + Sepal.Width 
               + Petal.Length 
               + Petal.Width,
               data=iris)

summary(model)

#### 
#### Here is the error I don't understand
#### 

# This is the standard use of the predict function
iris2 <- copy(iris)
iris2[, predict := predict(model, iris2)]

# This is the way I want to use the predict function
# This does not work for some reason
iris2 <- copy(iris)
iris2[, Species := "versicolor"]
iris2[, predict2 := predict(model, iris2)]

#### 
#### This is a dirty work-around
#### 

# Creating a modified dataframe
iris3 <- copy(iris)
iris3[, `:=`(Species = "versicolor",
             data = "Modified data")]

# copying the original dataframe
iris4 <- copy(iris)
iris4[, data := "Original data"]

# Stacking the original data and the modified data
iris5 <- rbind(iris3, iris4)
iris5[, predict := predict(model, iris5)]

# Keeping only the modified data
iris_final <- iris5[data == "Modified data"]

duckmayr · Accepted Answer

Not a solution, but an explanation for why it happens.

If we inspect the source code of RcppAramdillo:::predict.fastLm(), we find that it constructs the design matrix for the prediction points via

x <- model.matrix(object$formula, newdata)

On the other hand, if we inspect the source for stats::predict.lm(), we find

tt <- terms(object)
## Some source omitted here
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))  .checkMFClasses(cl, m)
X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)

which reveals that lm() stores in its result information about the factor levels and contrasts for predictors, while fastLm() reconstructs that information in the predict() call:

names(model)
# [1] "coefficients"  "stderr"        "df.residual"   "fitted.values"
# [5] "residuals"     "call"          "intercept"     "formula"      
names(lm_mod) ## Constructed with `lm()` call with same formula
#  [1] "coefficients"  "residuals"     "effects"       "rank"         
#  [5] "fitted.values" "assign"        "qr"            "df.residual"  
#  [9] "contrasts"     "xlevels"       "call"          "terms"        
# [13] "model"

Note the "xlevels" and "contrasts" elements in an lm object that are not present in a fastLm object. This goes to a larger point, though, from help("fastLM"):

Linear models should be estimated using the lm function. In some cases, lm.fit may be appropriate.

Dirk can correct me if I'm wrong, but I think the point of fastLm() is not to provide a rich OLS implementation that covers all the use cases that stats::lm() does; I think it's more illustrative.

If your problem is big data and that's why you don't want to use stats::lm(), might I suggest something like biglm::biglm()? (See, for example, here). If you are really set on using RcppArmadillo::fastLm(), you could do a smaller version of your workaround; instead of copying your whole data, just append one row to your prediction set for each unused factor level.

Using predict() with RcppArmadillo/RcppEigen when a factor level has only one level

Answers (1)

Related Questions