triphook
triphook

Reputation: 3087

Dask DataFrames vs numpy.memmap performance

I've developed a model which uses several large, 3-dimensional datasets on the order of (1e7, 10, 1e5), and makes millions of read (and thousands of write) calls on slices of those datasets. So far the best tool I've found for making these calls is numpy.memmap, which allows for minimal data to be held in RAM and allows for clean indexing and very fast recall of data directly on the hard drive.

The downside of numpy.memmmap seems to be that performance is pretty uneven - the time to read a slice of the array may vary by 2 orders of magnitude between calls. Furthermore, I'm using Dask to parallelize many of the model functions in the script.

How is the performance of Dask DataFrames for making millions of calls to a large dataset? Would swapping out memmaps for DataFrames substantially increase processing times?

Upvotes: 1

Views: 409

Answers (1)

JulianWgs
JulianWgs

Reputation: 1049

You would need to use Dask Array not Dask Dataframe. The performance is generally the same as Numpy, because Numpy does the actual computation.

Optimizations can speed up the calculation depending on the use case.

The overhead of the scheduler decreases performance. This is only applicable if you split the data into many partition and can usually be neglected.

Upvotes: 1

Related Questions