William
William

Reputation: 4034

Python Pandas multiprocessing no result return

I have a df,you can have it by copy paste:

import pandas as pd
from io import StringIO

df = """
  ValOption  RB test 
0       SLA  4  3       
1       AC   5  4           
2       SLA  5  5          
3       AC   2  4       
4       SLA  5  5         
5       AC   3  4          
6       SLA  4  3         

"""
df = pd.read_csv(StringIO(df.strip()), sep='\s+')

Output:

ValOption   RB  test
0     SLA   4   3
1     AC    5   4
2     SLA   5   5
3     AC    2   4
4     SLA   5   5
5     AC    3   4
6     SLA   4   3

Then I have 2 functions to build new columns for this df:

def func1():
    df['r1']=df['test']+1
    return df['r1']

def func2():
    df['r2']=df['RB']+1
    return df['r2']

After I call these 2 functions:

func1()
func2()

Output:

ValOption   RB  test    r1  r2
0    SLA    4   3      4    5
1     AC    5   4      5    6
2     SLA   5   5      6    6
3     AC    2   4      5    3
4     SLA   5   5      6    6
5     AC    3   4      5    4
6     SLA   4   3      4    5

But when I tried to use multiprocessing I can't get the new columns:

import multiprocessing
if __name__ ==  '__main__':

    p1 = multiprocessing.Process(target=func1)
    p2 = multiprocessing.Process(target=func2)

    p1.start()
    p2.start()

    p1.join()
    p2.join()

Output:

ValOption   RB  test
0    SLA    4   3
1     AC    5   4
2    SLA    5   5
3     AC    2   4
4    SLA    5   5
5     AC    3   4
6    SLA    4   3

The multiprocessing didn't return the values in the functions .Any friend can help?

Upvotes: 1

Views: 305

Answers (4)

Pascal G. Bernard
Pascal G. Bernard

Reputation: 279

ok, then change your code by creating a class :

from multiprocessing import Process

class Test:
    def __init__(self, df):
        self.df = df
        
    def func1(self):
        df['r1'] = df['test']+1

    def func2(self):
        df['r2'] = df['RB']+1

p1 = Process(target=Test(df).func1())
p2 = Process(target=Test(df).func2())

p1.start()
p2.start()

p1.join()
p2.join()

This should work, for sure

Upvotes: 1

Cimbali
Cimbali

Reputation: 11405

If you use a multiprocessing.Pool and rewrite your function to be more generic, you can use to map an input to an output:

>>> def func(series):
...   return series + 1
... 
>>> with multiprocessing.Pool(2) as p:
...   dat = p.map(func, [df['test'].rename('r1'), df['RB'].rename('r2')])
... 

Then outside of parallel processing, modify the dataframe with the obtained results, using e.g. df.join():

>>> df.join(dat)
  ValOption  RB  test  r1  r2
0       SLA   4     3   4   5
1        AC   5     4   5   6
2       SLA   5     5   6   6
3        AC   2     4   5   3
4       SLA   5     5   6   6
5        AC   3     4   5   4
6       SLA   4     3   4   5

Otherwise your best bet to get results would be a task queue, see the example at the bottom of the multiprocessing page. Again, you want to do the computations in the tasks, but not modify any shared data structure. After they execute you can join them together again.

More complex solutions would likely have to resort to multiprocessing.Manager subclasses, as pandas series and dataframes are complex objects not suited for the multiprocessing.sharedctypes.* options.

Upvotes: 1

Pascal G. Bernard
Pascal G. Bernard

Reputation: 279

I am guessing you're using a notebook and you're trying the cell containing if __name__ == '__main__': ?

if so, just run the function outside it - like that :

import multiprocessing

p1 = multiprocessing.Process(target=func1)
p2 = multiprocessing.Process(target=func2)

p1.start()
p2.start()

p1.join()
p2.join()

Or keep it but in this case, execute it as a python file.

Upvotes: 0

2e0byo
2e0byo

Reputation: 5954

Pandas dataframes are definitely not thread safe. Think about what would happen if you were halfway through func1 when func2 finished! (And pandas is definitely not atomic).

Fortunately multiprocessing has just copied the variable and worked on the copy (actually it has serialised the variable and sent it to the child process). So if you want to do work in multiprocessing, you adopt this workflow:

  • break task into steps
  • worker functions take steps and compute results
  • compile results and apply them back to the object

Have a look at some tutorials for multiprocessing.Pool to see how this is done.

Upvotes: 0

Related Questions