Reputation: 1
I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts? Thanks in advance,
Alvaro
Upvotes: 0
Views: 466
Reputation: 26
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.
Upvotes: 0
Reputation: 1
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest
package knows how to handle these cases, the pmml
package does not.
Upvotes: 0
Reputation: 26
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example. It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it. You should also consider emailing the package maintainers directly.
Upvotes: 0
Reputation: 4926
Looks like the variable splitNode
has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R
file inside "pmml" package.
So, what are the columns in your train
data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest
model type.
Upvotes: 1