Reputation: 139
I am using the package randomForest
to produce habitat suitability models for species. I thought everything was working as it should until I started looking at individual trees with getTree()
. The documentation (see page 4 of the randomForest vignette) states that for categorical variables, the split point will be an integer, which makes sense. However, in the trees I have looked at for my results, this is not the case.
The data frame I used to build the model was formatted with categorical variables as factors:
> str(df.full)
'data.frame': 27087 obs. of 23 variables:
$ sciname : Factor w/ 2 levels "Laterallus jamaicensis",..: 1 1 1 1 1 1 1 1 1 1 ...
$ estid : Factor w/ 2 levels "7694","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ pres : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
$ stratum : Factor w/ 89 levels "poly_0","poly_1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ra : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ...
$ eoid : Factor w/ 2 levels "0","psabs": 1 1 1 1 1 1 1 1 1 1 ...
$ avd3200 : num 0.1167 0.0953 0.349 0.1024 0.3765 ...
$ biocl05 : num 330 330 330 330 330 ...
$ biocl06 : num 66 65.8 66 65.8 66 ...
$ biocl08 : num 277 277 277 277 277 ...
$ biocl09 : num 170 170 170 170 170 ...
$ biocl13 : num 186 186 185 186 185 ...
$ cti : num 19.7 19 10.4 16.4 14.7 ...
$ dtnhdwat : num 168 240 39 206 309 ...
$ dtwtlnd : num 0 0 0 0 0 0 0 0 0 0 ...
$ e2em1n99 : num 0 0 0 0 0 0 0 0 0 0 ...
$ ems30_53 : Factor w/ 53 levels "0","602","2206",..: 19 4 17 4 19 19 4 4 19 19 ...
$ ems5607_46: num 0 0 1 0 0.4 ...
$ ksat : num 0.21 0.21 0.21 0.21 0.21 ...
$ lfevh_53 : Factor w/ 53 levels "0","11","16",..: 38 38 38 38 38 38 38 38 38 38 ...
$ ned : num 1.46 1.48 1.54 1.48 1.47 ...
$ soilec : num 14.8 14.8 19.7 14.8 14.8 ...
$ wtlnd_53 : Factor w/ 50 levels "0","3","7","11",..: 4 31 7 31 7 31 7 7 31 31 ...
This was the function call:
# rfStratum and sampSizeVec were previously defined
> rf.full$call
randomForest(x = df.full[, c(7:23)], y = df.full[, 3],
ntree = 2000, mtry = 7, replace = TRUE, strata = rfStratum,
sampsize = sampSizeVec, importance = TRUE, norm.votes = TRUE)
Here are the first 15 lines of an example tree (note that the variables in lines 1, 5, and 15 should be categorical, i.e., they should have integer split values):
> tree100
left daughter right daughter split var split point status prediction
1 2 3 ems30_53 9.007198e+15 1 <NA>
2 4 5 biocl08 2.753206e+02 1 <NA>
3 6 7 biocl06 6.110518e+01 1 <NA>
4 8 9 biocl06 1.002722e+02 1 <NA>
5 10 11 lfevh_53 9.006718e+15 1 <NA>
6 0 0 <NA> 0.000000e+00 -1 0
7 12 13 biocl05 3.310025e+02 1 <NA>
8 14 15 ned 2.814818e+00 1 <NA>
9 0 0 <NA> 0.000000e+00 -1 1
10 16 17 avd3200 4.199712e-01 1 <NA>
11 18 19 e2em1n99 1.724138e-02 1 <NA>
12 20 21 biocl09 1.738916e+02 1 <NA>
13 22 23 ned 8.837864e-01 1 <NA>
14 24 25 biocl05 3.442437e+02 1 <NA>
15 26 27 lfevh_53 9.007199e+15 1 <NA>
Additional information: I encountered this because I was investigating an error I was getting when predicting the results back onto the study area stating that the types of predictors in the new data did not match those of the training data. I have done 6 other iterations of this model using the same data frame and scripts (just with different subsets of predictors) and never before gotten this message. The only thing I could find that was different between the randomforest object in this run compared to that in the other runs is that the rf.full$forest$ncat
components are stored as double instead of integer
> for(i in 1:length(rf.full$forest$ncat)){
+ cat(names(rf.full$forest$ncat)[[i]], ": ", class(rf.full$forest$ncat[[i]]), "\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : numeric
ems5807_99 : numeric
hydgrp : numeric
ksat : numeric
lfevh_53 : numeric
ned : numeric
soilec : numeric
wtlnd_53 : numeric
>
> rf.full$forest$ncat
avd12800 cti dtnhdwat dtwtlnd ems2207_99 ems30_53 ems5807_99 hydgrp ksat lfevh_53
1 1 1 1 1 53 1 1 1 53
ned soilec wtlnd_53
1 1 50
However, xlevels (which appears to be a list of the predictor variables used and their types) are all showing the correct datatype for each predictor.
> for(i in 1:length(rf.full$forest$xlevels)){
+ cat(names(rf.full$forest$xlevels)[[i]], ": ", class(rf.full$forest$xlevels[[i]]),"\n")
+ }
avd12800 : numeric
cti : numeric
dtnhdwat : numeric
dtwtlnd : numeric
ems2207_99 : numeric
ems30_53 : character
ems5807_99 : numeric
hydgrp : character
ksat : numeric
lfevh_53 : character
ned : numeric
soilec : numeric
wtlnd_53 : character
# example continuous predictor
> rf.full$forest$xlevels$avd12800
[1] 0
# example categorical predictor
> rf.full$forest$xlevels$ems30_53
[1] "0" "602" "2206" "2207" "4504" "4507" "4702" "4704" "4705" "4706" "4707" "4717" "5207" "5307" "5600"
[16] "5605" "5607" "5616" "5617" "5707" "5717" "5807" "5907" "6306" "6307" "6507" "6600" "7002" "7004" "9107"
[31] "9116" "9214" "9307" "9410" "9411" "9600" "4607" "4703" "6402" "6405" "6407" "6610" "7005" "7102" "7104"
[46] "7107" "9000" "9104" "9106" "9124" "9187" "9301" "9505"
The ncat component is simply a vector of the number of categories per variable with 1 for continuous variables (as noted here), so it doesn't seem like it should matter if that is stored as an integer or a double, but it seems like this might all be related.
Questions
1) Shouldn't the split point for categorical predictors in any given tree of a randomForest forest be an integer, and if yes, any thoughts as to why factors in the data frame used as input to the randomForest call here are not being used as such?
2) Does the number type (double vs integer) of the ncat component of a randomForest object matter in any way related to model building, and any thoughts as to what could cause this to switch from integer in the first 6 runs to double in this last run (with each run containing different subsets of the same data)?
Upvotes: 2
Views: 1427
Reputation: 4926
The randomforest::randomForest
algorithm encodes low-cardinality (up to 32 categories) and high-cardinality (32 to 64? categories) categorical splits differently. Pay attention - all your "problematic" features belong to the latter class, and are encoded using 64-bit floating point values.
While the console output doesn't make sense for the human observer, the randomForest
model object/algorithm itself is correct (ie. treats those variables as categorical), and is making correct predictions.
If you want to investigate the structure of decision tree, and decision tree ensemble models, then you might consider exporting them to the PMML data format. For example, you can use the R2PMML package for this:
library("r2pmml")
r2pmml(rf.full, "MyRandomForest.pmml")
Then, open the MyRandomForest.pmml in a text editor, and you shall have a nice overview about the internals of your model (branches, split conditions, leaf values, etc).
Upvotes: 2