Reputation: 21961
I have a dataframe which I perform some operation on and print out. To do this, I have to iterate through each row.
for count, row in final_df.iterrows():
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
I decided to parallelize this using the python multiprocessing module
def write_site_files(row):
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []
import multiprocessing
while pkg_num < total_runs or len(threads):
if(len(threads) < num_proc and pkg_num < total_runs):
print pkg_num, total_runs
t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
pkg_num = pkg_num + 1
t.start()
threads.append(t)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
However, the latter (parallelized) method is way slower than the simple iteration based approach. Is there anything I am missing?
thanks!
Upvotes: 2
Views: 3459
Reputation: 128918
This will be way less efficient that doing this in a single process unless the actual operation take a lot of time, like seconds per row.
Normally parallelization is the last tool in the box. After profiling, after local vectorization, after local optimization, then you parallelize.
You are spending time just doing the slicing, then spinning up new processes (which is generally a constant overhead), then pickling a single row (not clear how big it is from your example).
At the very least, you should chunk the rows, e.g. df.iloc[i:(i+1)*chunksize]
.
There hopefully will be some support for parallel apply
in 0.14, see here: https://github.com/pydata/pandas/issues/5751
Upvotes: 6