Maria
Maria

Reputation: 169

Dask on data sets that fit in memory loading time

I understand that the main use of dask is for data that doesn't fit in memory, but I'm still curious.
Why the difference in time when creating a Pandas dataframe vs a Dask dataframe? (I read about the overhead, but should it be this significant?)

enter image description here

enter image description here

Upvotes: 2

Views: 321

Answers (1)

mdurant
mdurant

Reputation: 28683

You should not expect loading of the data frame to be any faster. At some point, the system needs to - stream bytes from the disk (a fixed cost) - parse text (this part is parallelizable) - pass data between workers (this may involve expensive serialization and communication) - get concatenated (this uses a lot of memory, and so may be expensive if you also have a lot of workers around)

How long it takes depends heavily on the scheduler you are using, because that effects how many copies of the data are needed and how much communication takes place. You may wish to try the distributed scheduler, with different mixtures of threads and processes. There is always some overhead for marshalling the tasks.

The dask model is to move computation to the data, not the other way around. If you operated on the dask dataframe (filter, group, compute, aggregate), and only did .compute() on a relatively small output, then the computing would take place in the same workers where the data is also loading, eliminating serialisation and communication costs.

Generally speaking, though, if the data fit comfortably in memory, then pandas probably does a pretty good job of being fast.

Upvotes: 1

Related Questions