Laura
Laura

Reputation: 139

randomForest using factor variables as continuous?

I am using the package randomForest to produce habitat suitability models for species. I thought everything was working as it should until I started looking at individual trees with getTree(). The documentation (see page 4 of the randomForest vignette) states that for categorical variables, the split point will be an integer, which makes sense. However, in the trees I have looked at for my results, this is not the case.

The data frame I used to build the model was formatted with categorical variables as factors:

> str(df.full)
'data.frame':   27087 obs. of  23 variables:
 $ sciname   : Factor w/ 2 levels "Laterallus jamaicensis",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ estid     : Factor w/ 2 levels "7694","psabs": 1 1 1 1 1 1 1 1 1 1 ...
 $ pres      : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
 $ stratum   : Factor w/ 89 levels "poly_0","poly_1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ra        : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ...
 $ eoid      : Factor w/ 2 levels "0","psabs": 1 1 1 1 1 1 1 1 1 1 ...
 $ avd3200   : num  0.1167 0.0953 0.349 0.1024 0.3765 ...
 $ biocl05   : num  330 330 330 330 330 ...
 $ biocl06   : num  66 65.8 66 65.8 66 ...
 $ biocl08   : num  277 277 277 277 277 ...
 $ biocl09   : num  170 170 170 170 170 ...
 $ biocl13   : num  186 186 185 186 185 ...
 $ cti       : num  19.7 19 10.4 16.4 14.7 ...
 $ dtnhdwat  : num  168 240 39 206 309 ...
 $ dtwtlnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ e2em1n99  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ems30_53  : Factor w/ 53 levels "0","602","2206",..: 19 4 17 4 19 19 4 4 19 19 ...
 $ ems5607_46: num  0 0 1 0 0.4 ...
 $ ksat      : num  0.21 0.21 0.21 0.21 0.21 ...
 $ lfevh_53  : Factor w/ 53 levels "0","11","16",..: 38 38 38 38 38 38 38 38 38 38 ...
 $ ned       : num  1.46 1.48 1.54 1.48 1.47 ...
 $ soilec    : num  14.8 14.8 19.7 14.8 14.8 ...
 $ wtlnd_53  : Factor w/ 50 levels "0","3","7","11",..: 4 31 7 31 7 31 7 7 31 31 ...

This was the function call:

# rfStratum and sampSizeVec were previously defined
> rf.full$call
randomForest(x = df.full[, c(7:23)], y = df.full[, 3], 
ntree = 2000, mtry = 7, replace = TRUE, strata = rfStratum, 
sampsize = sampSizeVec, importance = TRUE, norm.votes = TRUE)

Here are the first 15 lines of an example tree (note that the variables in lines 1, 5, and 15 should be categorical, i.e., they should have integer split values):

> tree100
   left daughter right daughter split var  split point status prediction
1              2              3  ems30_53 9.007198e+15      1       <NA>
2              4              5   biocl08 2.753206e+02      1       <NA>
3              6              7   biocl06 6.110518e+01      1       <NA>
4              8              9   biocl06 1.002722e+02      1       <NA>
5             10             11  lfevh_53 9.006718e+15      1       <NA>
6              0              0      <NA> 0.000000e+00     -1          0
7             12             13   biocl05 3.310025e+02      1       <NA>
8             14             15       ned 2.814818e+00      1       <NA>
9              0              0      <NA> 0.000000e+00     -1          1
10            16             17   avd3200 4.199712e-01      1       <NA>
11            18             19  e2em1n99 1.724138e-02      1       <NA>
12            20             21   biocl09 1.738916e+02      1       <NA>
13            22             23       ned 8.837864e-01      1       <NA>
14            24             25   biocl05 3.442437e+02      1       <NA>
15            26             27  lfevh_53 9.007199e+15      1       <NA>

Additional information: I encountered this because I was investigating an error I was getting when predicting the results back onto the study area stating that the types of predictors in the new data did not match those of the training data. I have done 6 other iterations of this model using the same data frame and scripts (just with different subsets of predictors) and never before gotten this message. The only thing I could find that was different between the randomforest object in this run compared to that in the other runs is that the rf.full$forest$ncat components are stored as double instead of integer

> for(i in 1:length(rf.full$forest$ncat)){
+   cat(names(rf.full$forest$ncat)[[i]], ": ", class(rf.full$forest$ncat[[i]]), "\n")
+ }
avd12800 :  numeric 
cti :  numeric 
dtnhdwat :  numeric 
dtwtlnd :  numeric 
ems2207_99 :  numeric 
ems30_53 :  numeric 
ems5807_99 :  numeric 
hydgrp :  numeric 
ksat :  numeric 
lfevh_53 :  numeric 
ned :  numeric 
soilec :  numeric 
wtlnd_53 :  numeric 
> 
> rf.full$forest$ncat
  avd12800        cti   dtnhdwat    dtwtlnd ems2207_99   ems30_53 ems5807_99     hydgrp       ksat   lfevh_53 
     1          1          1          1          1         53          1          1          1         53 
   ned     soilec   wtlnd_53 
     1          1         50

However, xlevels (which appears to be a list of the predictor variables used and their types) are all showing the correct datatype for each predictor.

> for(i in 1:length(rf.full$forest$xlevels)){
+   cat(names(rf.full$forest$xlevels)[[i]], ": ", class(rf.full$forest$xlevels[[i]]),"\n")
+ }
avd12800 :  numeric 
cti :  numeric 
dtnhdwat :  numeric 
dtwtlnd :  numeric 
ems2207_99 :  numeric 
ems30_53 :  character 
ems5807_99 :  numeric 
hydgrp :  character 
ksat :  numeric 
lfevh_53 :  character 
ned :  numeric 
soilec :  numeric 
wtlnd_53 :  character 

# example continuous predictor
> rf.full$forest$xlevels$avd12800
[1] 0
# example categorical predictor
> rf.full$forest$xlevels$ems30_53
 [1] "0"    "602"  "2206" "2207" "4504" "4507" "4702" "4704" "4705" "4706" "4707" "4717" "5207" "5307" "5600"
[16] "5605" "5607" "5616" "5617" "5707" "5717" "5807" "5907" "6306" "6307" "6507" "6600" "7002" "7004" "9107"
[31] "9116" "9214" "9307" "9410" "9411" "9600" "4607" "4703" "6402" "6405" "6407" "6610" "7005" "7102" "7104"
[46] "7107" "9000" "9104" "9106" "9124" "9187" "9301" "9505"

The ncat component is simply a vector of the number of categories per variable with 1 for continuous variables (as noted here), so it doesn't seem like it should matter if that is stored as an integer or a double, but it seems like this might all be related.

Questions

1) Shouldn't the split point for categorical predictors in any given tree of a randomForest forest be an integer, and if yes, any thoughts as to why factors in the data frame used as input to the randomForest call here are not being used as such?

2) Does the number type (double vs integer) of the ncat component of a randomForest object matter in any way related to model building, and any thoughts as to what could cause this to switch from integer in the first 6 runs to double in this last run (with each run containing different subsets of the same data)?

Upvotes: 2

Views: 1427

Answers (1)

user1808924
user1808924

Reputation: 4926

The randomforest::randomForest algorithm encodes low-cardinality (up to 32 categories) and high-cardinality (32 to 64? categories) categorical splits differently. Pay attention - all your "problematic" features belong to the latter class, and are encoded using 64-bit floating point values.

While the console output doesn't make sense for the human observer, the randomForest model object/algorithm itself is correct (ie. treats those variables as categorical), and is making correct predictions.

If you want to investigate the structure of decision tree, and decision tree ensemble models, then you might consider exporting them to the PMML data format. For example, you can use the R2PMML package for this:

library("r2pmml")
r2pmml(rf.full, "MyRandomForest.pmml")

Then, open the MyRandomForest.pmml in a text editor, and you shall have a nice overview about the internals of your model (branches, split conditions, leaf values, etc).

Upvotes: 2

Related Questions