Saving a mllib decision tree model to hdfs

Question

I have a Mllib decision tree model trained on a set of data. I want to be able to save and load the trained model whenever necessary.e.g. assume I train on million row data set and save it for future use. I found that using FileInputStream,FileOutputStream,ObjectInputStream,ObjectOutputStream I can save and load a Linear model because they made those constructor public as below.

you can save model to disk as following:

import java.io.FileOutputStream 
import java.io.ObjectOutputStream 
val fos = new FileOutputStream("e:/model.obj") 
val oos = new ObjectOutputStream(fos)   
oos.writeObject(model)   
oos.close

and load it in:

import java.io.FileInputStream 
import java.io.ObjectInputStream 
val fos = new FileInputStream("e:/model.obj") 
val oos = new ObjectInputStream(fos) 
val newModel = oos.readObject().asInstanceOf[org.apache.spark.mllib.classification.LogisticRegressionModel]

The above does syntactically works for DecisionTree as well but I cannot call the newModel.predict() since the Decision Tree constructors were not made public apparently.

Does anyone now how I can save and load models like DecisionTree,RandomForest,SVM,etc.?

Reactormonk · Accepted Answer

You could use the .save method on the model to store it as parquet file and load it via .load on the companion object. That saves it as parquet file, this should be faster than using plain java serialization, which is often slow.

See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.util.Saveable

Saving a mllib decision tree model to hdfs

Answers (1)

Related Questions