prk
prk

Reputation: 329

Can I set stage names in spark ML Pipelines?

I'm starting to create more complex ML pipelines and using the same type of pipeline stage multiple times. Is there a way to set the name of stages so that someone else can easily interrogate the pipeline that is saved and find out what is going on? e.g.

vecAssembler1 = VectorAssembler(inputCols = ["P1", "P2"], outputCol="features1")
vecAssembler2 = VectorAssembler(inputCols = ["P3", "P4"], outputCol="features2")
lr_1 = LogisticRegression(labelCol = "L1")
lr_2 = LogisticRegression(labelCol = "L2")
pipeline = Pipeline(stages=[vecAssembler1, vecAssembler2, lr_1, lr_2])
print pipeline.stages

this will return something like this:

[VectorAssembler_4205a9d090177e9c54ba, VectorAssembler_42b8aa29277b380a8513, LogisticRegression_42d78f81ae072747f88d, LogisticRegression_4d4dae2729edc37dc1f3]

but what I would like is to do something like:

pipeline = Pipeline(stages=[vecAssembler1, vecAssembler2, lr_1, lr_2], names=["VectorAssembler for predicting L1","VectorAssembler for predicting L1","LogisticRegression for L1","LogisticRegression for L2")

so that a saved pipeline model can be loaded by a third party and they will get nice descriptions:

print pipeline.stages
# [VectorAssembler for predicting L1,VectorAssembler for predicting L2,LogisticRegression for L1,LogisticRegression for L2]

Upvotes: 0

Views: 1337

Answers (1)

prudenko
prudenko

Reputation: 1701

You can use _resetUid method to rename each transformer/estimator:

vecAssembler1 = VectorAssembler(inputCols = ["P1", "P2"], outputCol="features1")
vecAssembler1._resetUid("VectorAssembler for predicting L1")

By default it uses java's UID random generator.

Upvotes: 1

Related Questions