How do I create a Dask DataFrame partition by partition and write portions to disk while the DataFrame is still incomplete?

Question

I have python code for data analysis that iterates through hundreds of datasets, does some computation and produces a result as a pandas DataFrame, and then concatenates all the results together. I am currently working with a set of data where these results are too large to fit into memory, so I'm trying to switch from pandas to Dask.

The problem is that I have looked through the Dask documentation and done some Googling and I can't really figure out how to create a Dask DataFrame iteratively like how I described above in a way that will take advantage of Dask's ability to only keep portions of the DataFrame in memory. Everything I see assumes that you either have all the data already stored in some format on disk, or that you have all the data in memory and now want to save it to disk.

What's the best way to approach this? My current code using pandas looks something like this:

def process_data(data) -> pd.DataFrame:
    # Do stuff
    return df

dfs = []
for data in datasets:
    result = process_data(data)
    dfs.append(result)
final_result = pd.concat(dfs)
final_result.to_csv("result.csv")

Guillaume EB · Accepted Answer

Expanding from @MichelDelgado comment, the correct approach should somethign like this:

import dask.dataframe as dd
from dask.delayed import delayed

def process_data(data) -> pd.DataFrame:
    # Do stuff
    return df

delayed_dfs = []
for data in datasets:
    result = delayed(process_data)(data)
    delayed_dfs.append(result)

ddf = dd.from_delayed(delayed_dfs)
ddf.to_csv('export-*.csv')

Note that this would created multiple CSV files, one per input partition.

You can find documentation here: https://docs.dask.org/en/stable/delayed-collections.html.

Also, be careful to actually read the data into the process function. So the data argument in the code above should only be an identifier, like a file path or equivalent.

How do I create a Dask DataFrame partition by partition and write portions to disk while the DataFrame is still incomplete?

Answers (1)

Related Questions