Lover Math
Lover Math

Reputation: 33

Why append list is slower when using multiprocess?

I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:

import pandas as pd
import numpy as np
import time
import multiprocessing

from multiprocessing import Manager

def generate_df(size):
    df = pd.DataFrame()
    for x in list('abcdefghi'):
        df[x] = np.random.normal(size=size)
    return df

def do_something(df_list,size,k):
    df = generate_df(size)
    df_list[k] = df

if __name__ == '__main__':
    size = 200000
    num_df = 30
    start = time.perf_counter()
    with Manager() as manager:
        df_list = manager.list(range(num_df))

        processes = []
        for k in range(num_df):
            p = multiprocessing.Process(target=do_something, args=(df_list,size,k,)) 
        p.start()
        processes.append(p)

    for process in processes:
        process.join()
    
    final_df = pd.concat(df_list)

    print(final_df.head())
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start,2)} second(s)')
    print(len(final_df))

The elapsed time is 7 secs.

I try to append list without Multiprocessing.

df_list = []
for _ in range(num_df):
    df_list.append(generate_df(size))

final_df = pd.concat(df_list)

But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?

Upvotes: 0

Views: 781

Answers (2)

Blckknght
Blckknght

Reputation: 104752

When you use manager.list, you're not using a normal Python list. You're using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It's slow because it's a non-trivial problem to keep everything consistent in that way.

You probably don't need all of that synchronization, it's just slowing you down. A much more natural way to do what you're attempting is to use a process pool and it's map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.

Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:

if __name__ == '__main__':
    size = 200000
    num_df = 30
    start = time.perf_counter()

    with multiprocessing.pool() as pool:
        df_list = pool.map(generate_df, [size]*num_df)

    final_df = pd.concat(df_list)
    print(final_df.head())
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start,2)} second(s)')
    print(len(final_df))

This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.

Upvotes: 4

Glauco
Glauco

Reputation: 1465

Two points:

  • Start and retrieve data from subprocess costs data must be transported between processes. This means that if transportation time is more than the time it takes to compute data you don't find benefits. This article can explain better the question.
  • In your implementation the bottleneck is in the df_list use. The Manager uses lock, this means that the processes are not free to write results into the list df_list

Upvotes: 1

Related Questions