Lefty
Lefty

Reputation: 405

Caching preprocessed data for ML in spark/pyspark

I would like a ML pipeline like this:

  1. raw_data = spark.read....()
  2. data = time_consuming_data_transformation(raw_data, preprocessing_params)
  3. model = fit_model(data)
  4. evaluate(model, data)

Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again? Ideally, I would like the cache to be automatically invalidated when the original data or transformation code (computing graph, preprocessing_params) change.

Upvotes: 0

Views: 409

Answers (1)

user9678477
user9678477

Reputation: 21

Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again?

You can of course:

data = time_consuming_data_transformation(raw_data, preprocessing_params).cache()

but if you're data is non-static, it is always better to write data to persistent storage:

time_consuming_data_transformation(raw_data, preprocessing_params).write.save(...)
data = spark.read.load(...)

It is more expensive than cache, but prevents hard to detect inconsistencies when data changes.

Ideally, I would like the cache to be automatically invalidated when the original data

No. Unless it is a streaming program (and learning on streams is not so trivial) Spark doesn't monitor changes in the source.

or transformation code (computing graph, preprocessing_params) change.

It is not clear for me how things change but it is probablly not something that Spark will solve for you. You might need some event driven or reactive components.

Upvotes: 2

Related Questions