Reputation: 411
Is it possible for dask to load a single row into memory at a time? I have a huge 200GB dataset and I would like dask to retrieve one row at a time given an index. Then I would like to get the numpy array from the row. When I try to call:
df_row = df.loc[index]
df_row = df_row.values.compute()
Dask tries to load the entire df into memory instead of just a small row. If I do not call compute and only call values then df_row remains a dask.array object. This seems like it must have an obvious solution as it is such a common and simple use case. What am I doing wrong?
Upvotes: 2
Views: 2717
Reputation: 28683
Dask will not load all rows in the case that it can know the start and end values of the index in each partition (called "divisions") without loading the data, and that the divisions form a monotonic progression.
For instance, the parquet data type usually stores column max/min values in the metadata, and so if the data was reasonably sorted, then .loc[]
will indeed only load the one partition containing the data.
However, with data formats such as CSV, it is not possibly to know whether a given partition contains a value of the index corresponding to the request without parsing and considering all of the data.
You may be interested in repartitioning or explicitly setting an index on your data or, if you know them independently, providing the values of the divisions before you try your .loc
operation.
Upvotes: 1