Reputation: 303
I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?
Here's a simplified snippet of my code:
temp_dd = dd.read_parquet(read_str, gather_statistics=False)
temp_dd = dask_client.scatter(temp_dd, broadcast=True)
dask_wait([temp_dd])
temp_dd = dask_client.gather(temp_dd)
while row_batch <= max_row:
row_batch_dd = temp_dd.get_partition(row_batch)
row_batch_dd = row_batch_dd.dropna()
row_batch_dd_len = row_batch_dd.index.size # <-- this is the current way I'm determining the length
row_batch = row_batch + 1
I note that, while I am reading a parquet, I can't simply use the parquet info (which is very fast) because, after reading, I do some partition-by-partition processing and then drop the NaNs. It's the post-processed length per partition that I'd like.
Upvotes: 4
Views: 3350
Reputation: 57301
df = dd.read_parquet(fn, gather_statistics=False)
df = df.dropna()
df.map_partitions(len).compute()
Upvotes: 3