Reputation: 1963
I am building a data processing pipeline. The data is quite large: a data frame representing sensor data sampled at a high frequency. During the pipeline, I have an intermediate result which is a transformation of the data which is needed in subsequent transformations. Using Dask, I found that the intermediate transformation has to be re-computed in each subsequent transformation.
Is there a way to persist the intermediate result on disk? I am aware of .persist()
, but this keeps the result on memory which is not an option due to the large size of my data.
Upvotes: 2
Views: 830
Reputation: 16551
There are several solutions, in varying degrees of complexity, including:
def compute_and_persist(file_path):
if file_path.isfile():
obj = load_file(file_path)
return obj
else:
obj = some_computation()
save_object(object=obj, path=file_path)
return obj
Such a function could be turned into a decorator, but as you can see it will require some coding. This is poor man's version of what graphchain
does under the hood.
use third-party package, graphchain
: see this answer for a minimal example.
use third-party package, prefect
: it has support to create persisted copied of the task results, see docs.
Upvotes: 3