Reputation: 415
I am using Pyspark and JPMML library to generate PMML models from my pipeline models. But I don't think it's generating properly. For testing this, I created two different pipeline models using the same dataset and the classifier as below.
pipeline = Pipeline(stages = [assembler, slicer,pca, binarizer,assembler2, formula,classifier])
pipeline2 = Pipeline(stages = [assembler, slicer, binarizer,assembler2, formula,classifier])
But when I generate the PMML file using the following code snippet, it outputs two identical files. Which means there is no difference between the models. I am confused. The generated PMML files should be different if it's converting properly right?
pipelineModel1 = pipeline.fit(df)
pmmlBytes = toPMMLBytes(spark, df, pipelineModel1)
with open('test.pmml','wb') as output:
output.write( pmmlBytes)
pipelineModel2 = pipeline2.fit(df)
pmmlBytes2 = toPMMLBytes(spark, df, pipelineModel2)
with open('test1.pmml','wb') as output:
output.write( pmmlBytes2)
Upvotes: 1
Views: 456
Reputation: 4926
The generated PMML files should be different if it's converting properly right?
Not necessarily. It all depends on your classification function - it may happen that PCA generated columns are simply not included in the PMML document, because they do not "contribute" to separating the classes. To test this hypothesis, try different classification functions such as DecisionTreeClassifier
vs. LogisticRegression
.
Also, the only way to verify whether a PMML document is correct or not is to execute it, and verify its results against the original Apache Spark(ML) results.
Upvotes: 1