Reputation: 153
Hello I try to load saved pipeline with Pipeline Model in pyspark.
selectedDf = reviews\
.select("reviewerID", "asin", "overall")
# Make pipeline to build recommendation
reviewerIndexer = StringIndexer(
inputCol="reviewerID",
outputCol="intReviewer"
)
productIndexer = StringIndexer(
inputCol="asin",
outputCol="intProduct"
)
pipeline = Pipeline(stages=[reviewerIndexer, productIndexer])
pipelineModel = pipeline.fit(selectedDf)
transformedFeatures = pipelineModel.transform(selectedDf)
pipeline_model_name = './' + model_name + 'pipeline'
pipelineModel.save(pipeline_model_name)
This code successfully save model in filesystem but the problem is that I can't load this pipeline to utilize it on other data. When I try to load model with following code I have this kind of error.
pipelineModel = PipelineModel.load(pipeline_model_name)
Traceback (most recent call last):
File "/app/spark/load_recommendation_model.py", line 12, in <module>
sa.load_model(pipeline_model_name, recommendation_model_name, user_id)
File "/app/spark/sparkapp.py", line 142, in load_model
pipelineModel = PipelineModel.load(pipeline_model_name)
File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 311, in load
File "/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 240, in load
File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 497, in loadMetadata
File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1379, in first
ValueError: RDD is empty
What is the problem? How can I solve this?
Upvotes: 6
Views: 2520
Reputation: 31
I had the same issue still years later. What Hossein said helps explain the root cause but in case you are not sure how to address it, here is what worked for me.
Make sure to use an external file path to store the file like an S3 path.
from pyspark.ml import PipelineModel
model_path = "s3://my-bucket/my_project/model"
pipelineModel.save(model_path)
Then you can move on and load the model later using the same path.
from pyspark.ml import PipelineModel
model_path = "s3://my-bucket/my_project/model"
pipelineModel.load(model_path)
Saving locally and pushing the model folder to S3 doesn't save all the data required to load the model later. So I couldn't save locally and then load using the S3 path.
PS: The reverse is also true. Downloading the folder onto e.g. Airflow and loading locally doesn't work but loading directly via S3 instead works.
Upvotes: 0
Reputation: 76
I had the same issue. The problem was that I was running Spark on a cluster of nodes, but I wasn't using a shared file system to save my models. Thus, saving the trained model leaded to saving the model's data on the Spark workers which had the data in their memory. When I wanted to load the data, I used the same path which I used in the saving process. In this situation, Spark master goes and looks for the model in the specified path in ITS LOCAL, but the data is not complete there. Therefore, it asserts that the RDD (the data) is empty (if you take a look at the directory of the saved model you will see that there are only SUCCESS
files, but for loading models, two other part-0000
files are necessary).
Using shared file systems like HDFS will fix the problem.
Upvotes: 5