Reputation: 33
I have a question regarding the use of the predict()
function with the RcppArmadillo
and RcppEigen
packages, when a factor variable has only one level. I built a MWE below using the iris
dataset.
I want to first estimate a linear regression model using RcppArmadillo
, and then use it to predict values. The data I use for estimation contains factor variables (with more than one level and no NA
). The prediction I want to make is slightly unusual in one respect: I want to predict values using the same factor level for all observations (this level being in levels used in the estimation). In the example below, it means that I want to predict Sepal.Length
as if all observations were from the "versicolor" Species.
This works well when I estimate the model using the lm()
function, but does not work with the RcppArmadillo::fastLm()
or RcppEigen::fastLm()
functions. I get the following error: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
. The same error happens again if one of the factor levels is missing. I understand well why at least two levels are necessary for estimation, but I don't understand why having only one level is a problem for prediction once the model has been properly estimated.
The obvious solution would be to use lm()
instead of fastLm()
, but this is unfortunately impossible because my data is quite big. After some trial-and-error, I found this dirty work-around:
Does anyone have a better solution than this, or at least an explanation as to why this error exists?
library(data.table)
# Loading iris data
iris <- as.data.table(iris)
# Estimating the model
model <-
RcppArmadillo::fastLm(Sepal.Length ~
factor(Species)
+ Sepal.Width
+ Petal.Length
+ Petal.Width,
data=iris)
summary(model)
####
#### Here is the error I don't understand
####
# This is the standard use of the predict function
iris2 <- copy(iris)
iris2[, predict := predict(model, iris2)]
# This is the way I want to use the predict function
# This does not work for some reason
iris2 <- copy(iris)
iris2[, Species := "versicolor"]
iris2[, predict2 := predict(model, iris2)]
####
#### This is a dirty work-around
####
# Creating a modified dataframe
iris3 <- copy(iris)
iris3[, `:=`(Species = "versicolor",
data = "Modified data")]
# copying the original dataframe
iris4 <- copy(iris)
iris4[, data := "Original data"]
# Stacking the original data and the modified data
iris5 <- rbind(iris3, iris4)
iris5[, predict := predict(model, iris5)]
# Keeping only the modified data
iris_final <- iris5[data == "Modified data"]
Upvotes: 3
Views: 213
Reputation: 16930
Not a solution, but an explanation for why it happens.
If we inspect the source code of RcppAramdillo:::predict.fastLm()
, we find that it constructs the design matrix for the prediction points via
x <- model.matrix(object$formula, newdata)
On the other hand, if we inspect the source for stats::predict.lm()
, we find
tt <- terms(object)
## Some source omitted here
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses"))) .checkMFClasses(cl, m)
X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)
which reveals that lm()
stores in its result information about the factor levels and contrasts for predictors, while fastLm()
reconstructs that information in the predict()
call:
names(model)
# [1] "coefficients" "stderr" "df.residual" "fitted.values"
# [5] "residuals" "call" "intercept" "formula"
names(lm_mod) ## Constructed with `lm()` call with same formula
# [1] "coefficients" "residuals" "effects" "rank"
# [5] "fitted.values" "assign" "qr" "df.residual"
# [9] "contrasts" "xlevels" "call" "terms"
# [13] "model"
Note the "xlevels"
and "contrasts"
elements in an lm
object that are not present in a fastLm
object. This goes to a larger point, though, from help("fastLM")
:
Linear models should be estimated using the lm function. In some cases, lm.fit may be appropriate.
Dirk can correct me if I'm wrong, but I think the point of fastLm()
is not to provide a rich OLS implementation that covers all the use cases that stats::lm()
does; I think it's more illustrative.
If your problem is big data and that's why you don't want to use stats::lm()
, might I suggest something like biglm::biglm()
? (See, for example, here). If you are really set on using RcppArmadillo::fastLm()
, you could do a smaller version of your workaround; instead of copying your whole data, just append one row to your prediction set for each unused factor level.
Upvotes: 4