Reputation: 2678
I am using DASK to read CSV file which sizes around 2GB. I want to write each row of it to separate 255 number CSV files based on some hash function as below.
from dask import dataframe as dd
if __name__ == '__main__':
df = dd.read_csv('train.csv', header=None, dtype='str')
df = df.fillna()
for _, line in df.iterrows():
number = hash(line[2]) % 256
with open("{}.csv".format(number), 'a+') as f:
f.write(', '.join(line))
This way takes around 15 minutes. Is there any way we can do it faster.
Upvotes: 0
Views: 351
Reputation: 28683
Since your procedure is dominated by IO, it is very unlikely that Dask would do anything but add overhead in this case, unless your hash function is really really slow. I assume that is not the case.
@zwer 's solution would look something like
files = [open("{}.csv".format(number), 'a+') for number in range(255)]
for _, line in df.iterrows():
number = hash(line[2]) % 256
files[number].write(', '.join(line))
[f.close() for f in files]
However, your data appears to fit in memory, so you may find much better performance
for (number, group) in df.groupby(df.iloc[:, 2].map(hash)):
group.to_csv("{}.csv".format(number))
because you write to each file continuously rather than jumping between them. Depending on your IO device and buffering, the difference can be none or huge.
Upvotes: 2