Slow speed while parallelizing operation on pandas dataframe

Question

I have a dataframe which I perform some operation on and print out. To do this, I have to iterate through each row.

for count, row in final_df.iterrows():
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

I decided to parallelize this using the python multiprocessing module

def write_site_files(row):
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []

import multiprocessing

while pkg_num < total_runs or len(threads):
    if(len(threads) < num_proc and pkg_num < total_runs):
        print pkg_num, total_runs
        t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
        pkg_num = pkg_num + 1
        t.start()
        threads.append(t)
    else:
        for thread in threads:
            if not thread.is_alive():
               threads.remove(thread)

However, the latter (parallelized) method is way slower than the simple iteration based approach. Is there anything I am missing?

thanks!

Jeff · Accepted Answer

This will be way less efficient that doing this in a single process unless the actual operation take a lot of time, like seconds per row.

Normally parallelization is the last tool in the box. After profiling, after local vectorization, after local optimization, then you parallelize.

You are spending time just doing the slicing, then spinning up new processes (which is generally a constant overhead), then pickling a single row (not clear how big it is from your example).

At the very least, you should chunk the rows, e.g. df.iloc[i:(i+1)*chunksize].

There hopefully will be some support for parallel apply in 0.14, see here: https://github.com/pydata/pandas/issues/5751

Slow speed while parallelizing operation on pandas dataframe

Answers (1)

Related Questions