Reputation: 3938
I have 10,000 csv files for which I have to open in Pandas and manipulate/transform using some of Pandas's function and save the new output to csv. Could I use a parallel process (for Windows) to make the work faster? I tried the following but no luck:
import pandas pd
import multiprocessing
def proc_file(file):
df = pd.read_csv(file)
df = df.reample('1S', how='sum')
df.to_csv('C:\\newfile.csv')
if __name__ == '__main__':
files = ['C:\\file1.csv', ... 'C:\\file2.csv']
for i in files:
p = multiprocessing.Process(target=proc_file(i))
p.start()
I don't think I have a good understanding of multiprocessing in Python.
Upvotes: 0
Views: 970
Reputation: 21981
Make sure to close the pool later too:
import multiprocessing
# Maximum number of cpus to use at a time
max_threads = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(max_threads)
list_files = pool.map(func,list_of_csvs)
pool.close()
pool.join()
list_files can contain a list e.g. you could return the name of the altered csv from func()
Upvotes: 1
Reputation: 76307
Maybe something like this:
p = multiprocessing.Pool()
p.map(prof_file, files)
For this size, you really need a process pool, so that the cost of launching a process is offset by the work it does. multiprocessing.Pool does exactly that: it transforms task parallelism (which is what you were doing) into task parallelism.
Upvotes: 1