Reputation: 10820
I have python code for data analysis that iterates through hundreds of datasets, does some computation and produces a result as a pandas DataFrame, and then concatenates all the results together. I am currently working with a set of data where these results are too large to fit into memory, so I'm trying to switch from pandas to Dask.
The problem is that I have looked through the Dask documentation and done some Googling and I can't really figure out how to create a Dask DataFrame iteratively like how I described above in a way that will take advantage of Dask's ability to only keep portions of the DataFrame in memory. Everything I see assumes that you either have all the data already stored in some format on disk, or that you have all the data in memory and now want to save it to disk.
What's the best way to approach this? My current code using pandas looks something like this:
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
dfs = []
for data in datasets:
result = process_data(data)
dfs.append(result)
final_result = pd.concat(dfs)
final_result.to_csv("result.csv")
Upvotes: 0
Views: 198
Reputation: 667
Expanding from @MichelDelgado comment, the correct approach should somethign like this:
import dask.dataframe as dd
from dask.delayed import delayed
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
delayed_dfs = []
for data in datasets:
result = delayed(process_data)(data)
delayed_dfs.append(result)
ddf = dd.from_delayed(delayed_dfs)
ddf.to_csv('export-*.csv')
Note that this would created multiple CSV files, one per input partition.
You can find documentation here: https://docs.dask.org/en/stable/delayed-collections.html.
Also, be careful to actually read the data into the process
function. So the data
argument in the code above should only be an identifier, like a file path or equivalent.
Upvotes: 1