Reputation: 14445
Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing?
import dask.dataframe as dd
df = dd.read_parquet("Filepath")
df.head() # takes 10 seconds
df = df.set_index('id')
df.head() # takes 10 minutes +
Upvotes: 4
Views: 2111
Reputation: 28684
As stated in the docs, set_index
sorts your data according to the new index, such that the divisions along that index split the data into its logical partitions. The sorting is the thing that requires the extra time, but will make operations working on that index much faster once performed. head()
on the raw file will fetch from the first data chunk on disc without regard for any ordering.
You are able to set the index without this ordering either with the index=
keyword to read_parquet
(maybe the data was inherently ordered already?) or with .map_partitions(lambda df: df.set_index(..))
, but this raises the obvious question, why would you bother, what are you trying to achieve? If the data were already sorted, then you could also have used set_index(.., sorted=True)
and maybe even the divisions keyword, if you happen to have the information - this would not need the sort, and be correspondingly faster.
Upvotes: 5