Kishan Mehta
Kishan Mehta

Reputation: 2678

DASK - Read Huge CSV and write to 255 differenct CSV files

I am using DASK to read CSV file which sizes around 2GB. I want to write each row of it to separate 255 number CSV files based on some hash function as below.

My naive solution:

from dask import dataframe as dd

if __name__ == '__main__':
    df = dd.read_csv('train.csv', header=None, dtype='str')
    df = df.fillna()
    for _, line in df.iterrows():
        number = hash(line[2]) % 256
        with open("{}.csv".format(number), 'a+') as f:
            f.write(', '.join(line))

This way takes around 15 minutes. Is there any way we can do it faster.

Upvotes: 0

Views: 351

Answers (1)

mdurant
mdurant

Reputation: 28683

Since your procedure is dominated by IO, it is very unlikely that Dask would do anything but add overhead in this case, unless your hash function is really really slow. I assume that is not the case.

@zwer 's solution would look something like

files = [open("{}.csv".format(number), 'a+') for number in range(255)]
for _, line in df.iterrows():
    number = hash(line[2]) % 256
    files[number].write(', '.join(line))
[f.close() for f in files]

However, your data appears to fit in memory, so you may find much better performance

for (number, group) in df.groupby(df.iloc[:, 2].map(hash)):
    group.to_csv("{}.csv".format(number))

because you write to each file continuously rather than jumping between them. Depending on your IO device and buffering, the difference can be none or huge.

Upvotes: 2

Related Questions