rodrigo-silveira
rodrigo-silveira

Reputation: 13078

Iterating Dask Dataframe

I'm trying to create a Keras Tokenizer out of a single column from hundreds of large CSV files. Dask seems like a good tool for this. My current approach eventually causes memory issues:

df = dd.read_csv('data/*.csv', usecol=['MyCol'])

# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)

How can I do this by parts? Something along the lines of:

df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
    # Process a chunk at a time

Upvotes: 1

Views: 2855

Answers (2)

MRocklin
MRocklin

Reputation: 57271

I also recommend map_partition when it suits your problem. However, if you really just want sequential access, and an API similar to read_csv(chunksize=...) then you might be looking for the partitions attribute

for part in df.partitions:
    process(model, part.compute())

Upvotes: 0

Amin Gheibi
Amin Gheibi

Reputation: 748

Dask DataFrame is technically a set of pandas dataframes, called partitions. When you get the underlying numpy array you are destroying the partitioning structure and it will be one big array. I recommend using map_partition function of Dask DataFrames to apply regular pandas functions on each partition separately.

Upvotes: 2

Related Questions