Whats the fastest way to loop through sorted dask dataframe?

Question

I'm new to Pandas and Dask, Dask dataframes wrap pandas dataframes and share most of the same function calls.

I using Dask to sort(set_index) a largeish csv file ~1,000,000 rows ~100columns. Once it's sorted I use itertuples() to grab each dataframe row, to compare with a row from a database with ~1,000,000 rows ~100 columns.
But it's running slowly (takes around 8 hours), is there a faster way to do this?

I used dask because it can sort very large csv files and has a flexible csv parsing engine. It'll also let me run more advanced operations on the dataset, and parse more data formats in the future

I could presort the csv but I want to see if Dask can be fast enough for my use case, it would make things alot more hands off in the long run.

mdurant · Accepted Answer

By using iter_tuples, you are bringing each row back to the client, one by one. Please read up on map_partitions or map to see how you can apply function to rows or blocks of the dataframe without pulling data to the client. Note that each worker should write to a different file, since they operate in parallel.

Whats the fastest way to loop through sorted dask dataframe?

Answers (1)

Related Questions