Multiprocessing in pandas

Question

Is it possible to partition a pandas dataframe to do multiprocessing?

Specifically, my DataFrames are simply too big and take several minutes to run even one transformation on a single processor.

I know, I could do this in Spark but a lot of code has already been written, so preferably I would like to stick with what I have and get parallel functionality.

Victor Lira · Accepted Answer

Slightly modifying https://stackoverflow.com/a/29281494/5351271 I could get a solution to work over rows.

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pandas.concat(ret_list)

def apply_row_foo(input_df):
    return input_df.apply((row_foo), axis=1)

n_chunks = 10

grouped = df.groupby(df.index // n_chunks)
applyParallel(grouped, apply_row_foo)

If the index is not merely a row number, just group by np.arange(len(df)) // n_chunks

Decidedly not elegant, but worked in my use case.

Multiprocessing in pandas

Answers (1)

Related Questions