Reputation: 309
I'm new to Pandas and Dask, Dask dataframes wrap pandas dataframes and share most of the same function calls.
I using Dask to sort(set_index) a largeish csv file ~1,000,000 rows ~100columns.
Once it's sorted I use itertuples() to grab each dataframe row, to compare with a row from a database with ~1,000,000 rows ~100 columns.
But it's running slowly (takes around 8 hours), is there a faster way to do this?
I used dask because it can sort very large csv files and has a flexible csv parsing engine. It'll also let me run more advanced operations on the dataset, and parse more data formats in the future
I could presort the csv but I want to see if Dask can be fast enough for my use case, it would make things alot more hands off in the long run.
Upvotes: 0
Views: 595
Reputation: 28684
By using iter_tuples, you are bringing each row back to the client, one by one. Please read up on map_partitions or map to see how you can apply function to rows or blocks of the dataframe without pulling data to the client. Note that each worker should write to a different file, since they operate in parallel.
Upvotes: 1