dan
dan

Reputation: 303

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?

Here's a simplified snippet of my code:

   temp_dd = dd.read_parquet(read_str, gather_statistics=False)
   temp_dd = dask_client.scatter(temp_dd, broadcast=True)
   dask_wait([temp_dd])
   temp_dd = dask_client.gather(temp_dd)

   while row_batch <= max_row:
       row_batch_dd = temp_dd.get_partition(row_batch)
       row_batch_dd = row_batch_dd.dropna()    
       row_batch_dd_len = row_batch_dd.index.size  # <-- this is the current way I'm determining the length
       row_batch = row_batch + 1

I note that, while I am reading a parquet, I can't simply use the parquet info (which is very fast) because, after reading, I do some partition-by-partition processing and then drop the NaNs. It's the post-processed length per partition that I'd like.

Upvotes: 4

Views: 3350

Answers (1)

MRocklin
MRocklin

Reputation: 57301

df = dd.read_parquet(fn, gather_statistics=False)
df = df.dropna()
df.map_partitions(len).compute()

Upvotes: 3

Related Questions