DASK - Read Huge CSV and write to 255 differenct CSV files

Question

I am using DASK to read CSV file which sizes around 2GB. I want to write each row of it to separate 255 number CSV files based on some hash function as below.

My naive solution:

from dask import dataframe as dd

if __name__ == '__main__':
    df = dd.read_csv('train.csv', header=None, dtype='str')
    df = df.fillna()
    for _, line in df.iterrows():
        number = hash(line[2]) % 256
        with open("{}.csv".format(number), 'a+') as f:
            f.write(', '.join(line))

This way takes around 15 minutes. Is there any way we can do it faster.

DASK - Read Huge CSV and write to 255 differenct CSV files

My naive solution:

Answers (1)

Related Questions