Reputation: 405
I would like a ML pipeline like this:
raw_data = spark.read....()
data = time_consuming_data_transformation(raw_data, preprocessing_params)
model = fit_model(data)
evaluate(model, data)
Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again? Ideally, I would like the cache to be automatically invalidated when the original data or transformation code (computing graph, preprocessing_params) change.
Upvotes: 0
Views: 409
Reputation: 21
Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again?
You can of course:
data = time_consuming_data_transformation(raw_data, preprocessing_params).cache()
but if you're data is non-static, it is always better to write data to persistent storage:
time_consuming_data_transformation(raw_data, preprocessing_params).write.save(...)
data = spark.read.load(...)
It is more expensive than cache, but prevents hard to detect inconsistencies when data changes.
Ideally, I would like the cache to be automatically invalidated when the original data
No. Unless it is a streaming program (and learning on streams is not so trivial) Spark doesn't monitor changes in the source.
or transformation code (computing graph, preprocessing_params) change.
It is not clear for me how things change but it is probablly not something that Spark will solve for you. You might need some event driven or reactive components.
Upvotes: 2