Parallel processing a large number of tasks

Question

I have 10,000 csv files for which I have to open in Pandas and manipulate/transform using some of Pandas's function and save the new output to csv. Could I use a parallel process (for Windows) to make the work faster? I tried the following but no luck:

import pandas pd
import multiprocessing

def proc_file(file):
    df = pd.read_csv(file)
    df = df.reample('1S', how='sum')
    df.to_csv('C:\newfile.csv')
if __name__ == '__main__':    
    files = ['C:\file1.csv', ... 'C:\file2.csv']

    for i in files:
        p = multiprocessing.Process(target=proc_file(i))
    p.start()

I don't think I have a good understanding of multiprocessing in Python.

user308827 · Accepted Answer

Make sure to close the pool later too:

import multiprocessing

# Maximum number of cpus to use at a time
max_threads = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(max_threads)
list_files = pool.map(func,list_of_csvs)
pool.close()
pool.join()

list_files can contain a list e.g. you could return the name of the altered csv from func()

Parallel processing a large number of tasks

Answers (2)

Related Questions