Reputation: 879

Parallelising / scheduling python function call on many files

I have a few hundred thousand csv files I would all like to apply the same function to. Something like the following dummy function:

def process_single_file(fname):
    df = pd.read_csv(fname)
    # Pandas and non-pandas processing
    df.to_csv(f"./output/{fname}")

As looping over all files individually would take too long, my question is what the most efficient way to schedule and parallelize this execution – no processes are dependent on each other. I started off trying to use python's multiprocessing:

import multiprocessing

files = sorted(glob.glob("./input/*.csv"))

processes = []
for fname in files:
    p = multiprocessing.Process(target=process_file, args=(fname,))
    processes.append(p)
    p.start()

for process in processes:
    process.join()

My computer, however, doesn't seem to like this process as it quickly overloads all CPU's and leading to slow-downs and crashes. Is there a more efficient way to reduce the workload of all CPU's and schedule the tasks such as using Dask, some Bash script or changing python? Thanks in advance.

Upvotes: 0

Answers (3)

Big Bro

Reputation: 944

It really depends on where your bottleneck is : are you spending more time reading / writing files, or doing CPU processing stuff ?

This RealPython tutorial really helped me a lot learning about all this stuff, I can only recommend a good read ;)

As explained in the tutorial, if I/O, multithreading is enough (and possibly better than multiprocessing):

def process_all_files(files):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(process_single_file, files)

And if CPU, multiprocessing will let you use all your available cores:

def process_all_files(files):
    with multiprocessing.Pool() as pool:
        pool.map(process_single_file, files)

Upvotes: 2

A Modgil

Reputation: 290

Absolutely pool is the way to go. Something along the below lines

`from multiprocessing import Pool

def f(x): return x*x

if name == 'main': pool = Pool(processes=4) `

check the following post

Using multiprocessing.Process with a maximum number of simultaneous processes

Upvotes: 0

SteRinaldi

Reputation: 40

You can try Ray, it is a quite efficient module to parallelize tasks

Upvotes: 1

Parallelising / scheduling python function call on many files

Answers (3)

Related Questions