Pipeline does not get converted to PMML properly using JPMML and Pyspark

Question

I am using Pyspark and JPMML library to generate PMML models from my pipeline models. But I don't think it's generating properly. For testing this, I created two different pipeline models using the same dataset and the classifier as below.

pipeline = Pipeline(stages = [assembler, slicer,pca, binarizer,assembler2, formula,classifier])
pipeline2 = Pipeline(stages = [assembler, slicer, binarizer,assembler2, formula,classifier])

But when I generate the PMML file using the following code snippet, it outputs two identical files. Which means there is no difference between the models. I am confused. The generated PMML files should be different if it's converting properly right?

pipelineModel1 = pipeline.fit(df)
pmmlBytes = toPMMLBytes(spark, df, pipelineModel1)
with open('test.pmml','wb') as output:
output.write( pmmlBytes)

pipelineModel2 = pipeline2.fit(df)
pmmlBytes2 = toPMMLBytes(spark, df, pipelineModel2)
with open('test1.pmml','wb') as output:
output.write( pmmlBytes2)

user1808924 · Accepted Answer

The generated PMML files should be different if it's converting properly right?

Not necessarily. It all depends on your classification function - it may happen that PCA generated columns are simply not included in the PMML document, because they do not "contribute" to separating the classes. To test this hypothesis, try different classification functions such as DecisionTreeClassifier vs. LogisticRegression.

Also, the only way to verify whether a PMML document is correct or not is to execute it, and verify its results against the original Apache Spark(ML) results.

Pipeline does not get converted to PMML properly using JPMML and Pyspark

Answers (1)

Related Questions