YQ.Wang
YQ.Wang

Reputation: 1177

How to save PCA object in spark scala?

I'm doing PCA on my data and I read the guide from: https://spark.apache.org/docs/latest/mllib-dimensionality-reduction

The relevant code is following:

import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD

val data: RDD[LabeledPoint] = sc.parallelize(Seq(
  new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 1)),
  new LabeledPoint(1, Vectors.dense(1, 1, 0, 1, 0)),
  new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0)),
  new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 0)),
  new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0))))

// Compute the top 5 principal components.
val pca = new PCA(5).fit(data.map(_.features))

// Project vectors to the linear space spanned by the top 5 principal
// components, keeping the label
val projected = data.map(p => p.copy(features = pca.transform(p.features)))

This code perform PCA upon the data. However, I can't find example code or doc explaining how to save and load the fitted PCA object for future using. Could someone give me an example based on the above code?

Upvotes: 0

Views: 553

Answers (2)

YQ.Wang
YQ.Wang

Reputation: 1177

The example code based on @EmiCareOfCell44 answer, using PCA and PCAModel from org.apache.spark.ml.feature:

import org.apache.spark.ml.feature.{PCA, PCAModel}
import org.apache.spark.ml.linalg.Vectors

val data = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(3)
  .fit(df)

val result = pca.transform(df).select("pcaFeatures")
result.show(false)

// save the model
val savePath = "xxxx"
pca.save(savePath)

// load the save model
val pca_loaded = PCAModel.load(savePath)

Upvotes: 0

Emiliano Martinez
Emiliano Martinez

Reputation: 4133

It seems that the PCA mlib version does not support save the model to disk. You can save the pc matrix of the resulting PCAModel instead. However, use the spar ML version. It returns a Spark Estimator that can be serialized and included in a Spark ML Pipeline.

Upvotes: 1

Related Questions