CVlm with categorical variables: factor has new levels

Question

I am using lm for MLR and CVlm for cross-validation. My data contains two categorical variables (one of them with 11 levels and the other one with only 2). Everything seems to work fine when using lm, the problem is when I try to use CVlm. I have errors because of the factor levels. I read some post about that, although I don't understand very well (for CVlm I am using the same data that for CVlm so, I don't know why this error and how I could handle it). Here, it is a sample of my data:

      dput(head(data))
      structure(list(LagO3 = c(35.0092884462795, 37.7681232441784, 
      31.9993881550014, 32.5950690475087, 37.2233826323784, 42.531864470374
      ), Z = c(165.252173124639, 166.145467346544, 161.857655081398, 
      177.043656853793, 200.269306623339, 207.772978087346), RH = c(86.4605102539062, 
      93.2499008178711, 87.1677398681641, 81.0183639526367, 74.1963653564453, 
      78.7728729248047), SR = c(310.165555555556, 343.304444444444, 
      329.844444444444, 299.145555555556, 319.321111111111, 327.731111111111
      ), ST = c(320.032313368056, 286.879364149306, 295.939059244792, 
      319.065705295139, 316.955619574653, 297.229990234375), TC = c(0.0362091064453125, 
      0.171852111816406, 0.607879638671875, 0.770919799804688, 0.553321838378906, 
      0.04547119140625), Tmx = c(289.281782049361, 289.283827735997, 
      289.913899219804, 288.649664878918, 289.756381348852, 290.302579680594
      ), Wd = c(11.0027627927081, 2.83403791472211, 3.69153840122015, 
      6.65367358341413, 4.17920155713043, 5.35254406830185), CWT = structure(c(1L, 
      9L, 5L, 4L, 4L, 4L), .Label = c("A", "C", "E", "N", "NE", "NW", 
      "S", "SW", "U", "W"), class = "factor"), LW = structure(c(1L, 
      2L, 2L, 2L, 2L, 1L), .Label = c("0", "LW"), class = "factor"), 
      o3 = c(37.7681232441784, 31.9993881550014, 32.5950690475087, 
      37.2233826323784, 42.531864470374, 48.3496367346306)), .Names = c("LagO3", 
      "Z", "RH", "SR", "ST", "TC", "Tmx", "Wd", "CWT", "LW", "o3"), row.names = c(NA, 
      6L), class = "data.frame")

This would be my model:

   model<-  lm(formula = o3 ~ LagO3 + Z + RH + ST + TC + Tmx + Wd + CWT, 
       data = data, na.action = na.exclude)

When I try to do CV:

      cvlm.mod <- CVlm(na.omit(data),model,m=10)

I have the error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor CWT has new levels S

The data$CWT has the levels: levels(data$CWT) [1] "A" "C" "E" "N" "NE" "NW" "S" "SW" "U" "W"

I figured out that the error might happen because data$CWT=="S" occurs only once (among the 920 observations of the data)...so my guess is that error appears due to that, since, adding one more value for "S" in data$CWT, CVlm works fine. But, I am still stuck, I don't know how I can handle this kind of cases.

Thanks again!!!

LyzandeR · Accepted Answer

This is the typical problem of having different levels in the factor variables between the folds in the cross validation. The algorithm creates dummy variables for the training set but the test set has different levels to the training set and thus the error. The solution is to create the dummy variables on your own and then use the CVlm function:

Solution

dummy_LW <- model.matrix(~LW, data=df)[,-1]    #dummy for LW
dummy_CWT <- model.matrix(~CWT, data=df)[,-1]  #dummies for CWT
df <- Filter(is.numeric,df)                    #exclude LW and CWT from original dataset
df <- cbind(df,dummy_LW,dummy_CWT)             #add the dummies instead

Then run the model as you did (make sure you add the new variable names):

model<-  lm(formula = o3 ~ LagO3 + Z + RH + ST + TC + Tmx + dummy_LW + 
                           CWTC + CWTE + CWTN + CWTNE + CWTNW + CWTS + 
                           CWTSW + CWTU + CWTW, 
            data = df, na.action = na.exclude)
cvlm.mod <- CVlm(na.omit(data),model,m=10)

Unfortunately, I cannot test the above as your code has too few rows to work (only 6 rows are not enough) but the above will work.

A few words about model.matrix:

It creates dummy variables for categorical data. By default is leaves one level out as the reference level (as it should), because you will have a correlation of 1 between dummies otherwise. [,-1] in the above code just removes the intercept which is an unneeded column of 1s.

CVlm with categorical variables: factor has new levels

Answers (1)

Related Questions