Reputation: 4807
I have 100s of csv files each storing same number of columns. Instead of reading them one at a time I want to implement multiprocessing.
For representation I have created 4 files: Book1.csv, Book2.csv, Book3.csv, Book4.csv and they store numbers 1 though 5 in each of them in column A starting row 1.
I am trying the following:
import pandas as pd
import multiprocessing
import numpy as np
def process(file):
return pd.read_csv(file)
if __name__ == '__main__':
loc = r'I:\Sims'
fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
p = multiprocessing.Pool()
for f in fname:
p.apply_async(process, [f])
p.close()
p.join()
I got the idea for above code from the link.
But the above code is not producing the desired result which I expected would be:
1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5
Edit: I want to load each of the file in separate processor and combine the file contents. Since I have 100s of files to load and combine the contents, I was hoping to make the process faster if I was loding 4 files (my PC has 4 processors) at a time.
Upvotes: 3
Views: 3400
Reputation: 182
Try this
import pandas as pd
import multiprocessing
import numpy as np
def process(file):
return pd.read_csv(file)
if __name__ == '__main__':
loc = r'I:\Sims'
fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
with multiprocessing.pool(5) as p: #Create a pool of 5 workers
result = p.map(process, fname)
print(len(result))
Upvotes: 3