Nastasia
Nastasia

Reputation: 667

I don't succeed to save (serialize) a zip file with Scikit-Learn with MLeap in Python

I tried that:

#Generate data
import pandas as pd 
import numpy as np

df = pd.DataFrame(np.random.randn(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df["y"] = (df['a'] > 0.5).astype(int)
df.head()

from mleap.sklearn.ensemble.forest import RandomForestClassifier

forestModel = RandomForestClassifier()
forestModel.mlinit(input_features='a',
                   feature_names='a',
                           prediction_column='e_binary')


forestModel.fit(df[['a']], df[['y']])

forestModel.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip", "randomforest.zip")

I got this error:

No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip/randomforest.zip.node'

I tried that too: forestModel.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleaptestmodelforestpysparkzip/randomforest.zip")

And got an error saying that the "model_name" attribute is missing.

Could you help me please?


I add all the things I tried to do and the results I got:

Pipeline to Zip:

1.

pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest")

=> FileNotFoundError: [Errno 2] No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/model.json'

2.

pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)

FileNotFoundError: [Errno 2] No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'

3.

pipeline.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True) and creation of "/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest"

=> FileNotFoundError: [Errno 2] No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'

4.

pipeline.serialize_to_bundle("/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)

=> FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'

5.

pipeline.serialize_to_bundle("/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)

=> OSError: [Errno 95] Operation not supported - But save something

  1. pipeline.serialize_to_bundle("jar:dbfs:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip", model_name="forest", init=True)

=> FileNotFoundError: [Errno 2] No such file or directory: 'jar:dbfs:/dbfs/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest

7.

pipeline.serialize_to_bundle("jar:dbfs:/FileStore/tables/lifttruck_mleap/pipeline_zip2/1/model.zip", model_name="forest", init=True)

=> FileNotFoundError: [Errno 2] No such file or directory: 'jar:dbfs:/FileStore/tables/mleap/pipeline_zip/1/model.zip/forest'

8.

pipeline.serialize_to_bundle("dbfs:/FileStore/tables/lifttruck_mleap/pipeline_zip2/1/model.zip", model_name="forest", init=True)

=> FileNotFoundError: [Errno 2] No such file or directory: 'dbfs:/FileStore/tables/mleap/pipeline_zip2/1/model.zip/forest'


Model to zip

  1. forest.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip", model_name="forest")

=> FileNotFoundError: [Errno 2] No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip/forest.node'

  1. forest.serialize_to_bundle("jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1", model_name="model.zip")

=> FileNotFoundError: [Errno 2] No such file or directory: 'jar:file:/dbfs/FileStore/tables/mleap/random_forest_zip/1/model.zip.node'

  1. forest.serialize_to_bundle("/dbfs/FileStore/tables/mleap/random_forest_zip/1", model_name="model.zip")

=> Don't save a zip. Save a bundle instead.

Upvotes: 1

Views: 365

Answers (1)

Nastasia
Nastasia

Reputation: 667

I found the problem and a workaround.

It is not possible anymore to do random writes with Databricks as explained here: https://docs.databricks.com/data/databricks-file-system.html?_ga=2.197884399.1151871582.1592826411-509486897.1589442523#local-file-apis

A workaround is to write the zip file in the local filesystem and then copy it into DBFS. So:

  1. Serialize your model in a Pipeline using "init=True" saving it in a local dir
  2. Copy it to your datalake by using "dbutils.fs.cp(source, destination)"

dbutils.fs.cp(source, destination)

Upvotes: 0

Related Questions