Dask - Reading partitions in order using itertuples

Question

I'm using Dask to read in a table with around 14 million rows using read_sql_table. When I read the dataframe using itertuples the index (which is ordered in the table) is not read out sequentially for one or two partitions. How is it possible to enforce this? The row_id is generated by row_number (on the view) and is used as the index when generating the dataframe. I know Pandas has a sorted=True arg, anything similar?

This is what happens at the moment, while reading the data in (number of rows read should match the current index): INFO - Read 11870000 Rows (index: 11870000) INFO - Read 11880000 Rows (index: 11880000) INFO - Read 11890000 Rows (index: 11890000) INFO - Read 11900000 Rows (index: 11900000) --INFO - Read 11910000 Rows (index: 12159912)-- INFO - Read 11920000 Rows (index: 12169912) INFO - Read 11930000 Rows (index: 12179912) INFO - Read 11940000 Rows (index: 12189912)

All is good until the 11,900,000th row, and at that point it switches in the wrong partition.

Dask - Reading partitions in order using itertuples

Answers (1)

Related Questions