How to tell r2pmml what dataType are my variables?

I want to export an R model in pmml format and use it elsewhere. The other software requires some variables as integers but all numeric variables are exported as double instead, even when they are explicitly integer in my dataset.

I tried to bypass this problem by changing them manually (or with regex) and I deleted every decimal but while the software accepts the new format, the prediction is not what I expect (because I just deleted decimals), so I want to solve this directly inside R.

How can I force my variables to be a certain dataType (particularly "integer")?

This is a code example that exports a .pmml:

# Required packages -------------------------------------------------------

library(tidyverse)
library(r2pmml)
library(randomForest)
library(nnet)

# Dataset creation --------------------------------------------------------

seed = 1
data = data.frame(
  var1 = round(runif(10) * 100),
  var2 = round(runif(10) * 100),
  y = round(runif(10) * 100)
)

data =
  data %>%
  mutate(var1 = as.integer(var1),
         var2 = as.integer(var2))

# Structure check ---------------------------------------------------------

str(data)

# Neural Network and Random Forest models ---------------------------------

nn =
  nnet(
    y ~ .,
    data = data,
    method = "nnet",
    size = c(2),
    linout = 1
  )

rf =
  randomForest(y ~ .,
               data = data)

# pmml export -------------------------------------------------------------

r2pmml(rf,
       file = "rf.pmml",
       dataset = data,
       verbose = TRUE)

r2pmml(nn,
       file = "nn.pmml",
       dataset = data,
       verbose = TRUE)

I expect my pmml to have variables var1 and var2 as an integer, but they end up being double in this section of the output

    <DataDictionary>
        <DataField name="y" optype="continuous" dataType="double"/>
        <DataField name="var1" optype="continuous" dataType="double"/>
        <DataField name="var2" optype="continuous" dataType="double"/>

and I got decimal numbers in

        <NeuralLayer activationFunction="logistic">
            <Neuron id="hidden/1" bias="-0.4112317232771385">
                <Con from="input/1" weight="-6.591508925328581"/>
                <Con from="input/2" weight="-31.805468580606753"/>
            </Neuron>

but I'm not sure if that should be integer or double.

Upvotes: 0

Views: 204

Answers (1)

user1808924
user1808924

Reputation: 4926

With the R2PMML package, and its underlying JPMML-R library being open source, you can always take a look into the source code (of the version that you're using) to see how things are implemented. In case of the nnet model type, you could take a look into the org.jpmml.rexp.NNetConverter class.

Essentially, there are two options. First, the R model object (nnet objects saved into RDS file) may not contain any feature type information at all. Second, this information might be there, but the converter is not using it yet - it is defaulting to the default data type of the nnet algorithm (all numeric computation works is done using the double data type, so it seems like a good choice for storing in the PMML document).

Where exactly is it recorded in your R model object(s) that features var1 and var2 are integers (instead of doubles)? If you think you've found the answer, consider opening a feature request with the JPMML-R project.

Upvotes: 1

Related Questions