Reputation: 425
Is it possible to partition a pandas dataframe to do multiprocessing?
Specifically, my DataFrames are simply too big and take several minutes to run even one transformation on a single processor.
I know, I could do this in Spark but a lot of code has already been written, so preferably I would like to stick with what I have and get parallel functionality.
Upvotes: 3
Views: 6626
Reputation: 101
Slightly modifying https://stackoverflow.com/a/29281494/5351271 I could get a solution to work over rows.
from multiprocessing import Pool, cpu_count
def applyParallel(dfGrouped, func):
with Pool(cpu_count()) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pandas.concat(ret_list)
def apply_row_foo(input_df):
return input_df.apply((row_foo), axis=1)
n_chunks = 10
grouped = df.groupby(df.index // n_chunks)
applyParallel(grouped, apply_row_foo)
If the index is not merely a row number, just group by np.arange(len(df)) // n_chunks
Decidedly not elegant, but worked in my use case.
Upvotes: 4