MLlib not saving the model data in Spark 2.1

Question

We have a machine learning model that looks roughly like this:

sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo) 
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture

In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.

When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:

cannot overwrite folder

Which means is somewhere, but we can't find it.

Would you have any pointers?

Shaido · Accepted Answer

Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.

You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:

1.hadoop fs -get  
2.hadoop fs -copyToLocal

It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

MLlib not saving the model data in Spark 2.1

Answers (1)

Related Questions