Plug4
Plug4

Reputation: 3938

Parallel processing a large number of tasks

I have 10,000 csv files for which I have to open in Pandas and manipulate/transform using some of Pandas's function and save the new output to csv. Could I use a parallel process (for Windows) to make the work faster? I tried the following but no luck:

import pandas pd
import multiprocessing

def proc_file(file):
    df = pd.read_csv(file)
    df = df.reample('1S', how='sum')
    df.to_csv('C:\\newfile.csv')
if __name__ == '__main__':    
    files = ['C:\\file1.csv', ... 'C:\\file2.csv']

    for i in files:
        p = multiprocessing.Process(target=proc_file(i))
    p.start() 

I don't think I have a good understanding of multiprocessing in Python.

Upvotes: 0

Views: 970

Answers (2)

user308827
user308827

Reputation: 21981

Make sure to close the pool later too:

import multiprocessing

# Maximum number of cpus to use at a time
max_threads = multiprocessing.cpu_count() - 1

pool = multiprocessing.Pool(max_threads)
list_files = pool.map(func,list_of_csvs)
pool.close()
pool.join()

list_files can contain a list e.g. you could return the name of the altered csv from func()

Upvotes: 1

Ami Tavory
Ami Tavory

Reputation: 76307

Maybe something like this:

p = multiprocessing.Pool()
p.map(prof_file, files)

For this size, you really need a process pool, so that the cost of launching a process is offset by the work it does. multiprocessing.Pool does exactly that: it transforms task parallelism (which is what you were doing) into task parallelism.

Upvotes: 1

Related Questions