Reputation: 9869
Is this a valid way of loading subsets of a dask dataframe to memory:
while i < len_df:
j = i + batch_size
if j > len_df:
j = len_df
subset = df.loc[i:j,'source_country_codes'].compute()
I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc
attribute.
I am using version 0.15.2
In terms of use cases, this would be a way of loading batches of data to deep learning (say keras).
Upvotes: 4
Views: 1342
Reputation: 57271
If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.
for part in df.to_delayed():
subset = part.compute()
You can roughly control the size by repartitioning beforehand
for part in df.repartition(npartitions=100).to_delayed():
subset = part.compute()
This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.
Upvotes: 3