Michael
Michael

Reputation: 1963

Dask: Keep intermediate results on disk instead of memory

I am building a data processing pipeline. The data is quite large: a data frame representing sensor data sampled at a high frequency. During the pipeline, I have an intermediate result which is a transformation of the data which is needed in subsequent transformations. Using Dask, I found that the intermediate transformation has to be re-computed in each subsequent transformation.

Is there a way to persist the intermediate result on disk? I am aware of .persist(), but this keeps the result on memory which is not an option due to the large size of my data.

Upvotes: 2

Views: 830

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16551

There are several solutions, in varying degrees of complexity, including:

  1. most basic: include explicit persisting to disk. The rough idea here would be to create an additional function that checks if there is a persisted result, if there is one, the function loads and returns it, otherwise the function computes the result, persists it as a file, then returns it. The advantage of this approach is that no additional dependencies are introduced. Very rough pseudocode:
def compute_and_persist(file_path):
    if file_path.isfile():
       obj = load_file(file_path)
       return obj
    else:
       obj = some_computation()
       save_object(object=obj, path=file_path)
       return obj

Such a function could be turned into a decorator, but as you can see it will require some coding. This is poor man's version of what graphchain does under the hood.

  1. use third-party package, graphchain: see this answer for a minimal example.

  2. use third-party package, prefect: it has support to create persisted copied of the task results, see docs.

Upvotes: 3

Related Questions