How to optimize iteration over pandas dataframe when vectorization is not possible

Question

I have a rather complex computation to make which I want to optimize. I would leverage pandas/numpy vectorization but not sure if it possible to solve this task with vectorization. Currently I am using pandas dataframe but using for loop to iterate over dataframe.

Task Explanation:

The lookup table gives the mapping. For example the old "1" is split between new "11" and "12" with the split proportion given as weight. 60% of "1" was passed to "11" and 40% was passed to "12". Another example is where old "2" was renamed to new "20". So the split ratio here is 100%. Similarly for old "3" which is split into new "31", "32", "33".

The actual values for "1", "2", "3" are given in table df. In the resulting table new_df the old values need to be replaced with new by multiplying the respective weights/ratio.

I hope this explanation will suffice. The tables below will help more.

weights (it is a lookup table)

data = {'old':['1','1','2','3','3','3'], 'new':['11','12', '20','31','32','33'], 'Weight': [0.6, 0.4,1 ,0.2, 0.5, 0.3]} 
weights = pd.DataFrame(data) 
weights

     old   new    Weight
0    1     11     0.6
1    1     12     0.4
2    2     20     1
3    3     31     0.2
4    3     32     0.5
5    3     33     0.3

df ( actual data table/ matrix)

data = {'A1':['1','1', '1', '2','2','2', '3','3'], 'A2':['1','2','3','1','2','3','1','2'], 'value': [50, 40, 30 ,10, 20, 70, 80, 90]} 
df = pd.DataFrame(data) 
df

     A1  A2  value
0    1   1   50   
1    1   2   40    
2    1   3   30     
3    2   1   10    
4    2   2   20    
5    2   3   70    
6    3   1   80
7    3   2   90

new_df ( resulting dataframe)

        A1_new  A2_new  value_new  | calculation (only for explanation)
   0    11      11      18         | 50 * 0.6 * 0.6
   1    12      12      8          | 50 * 0.4 * 0.4
   2    11      12      12         | 50 * 0.6 * 0.4
   3    12      11      12         | 50 * 0.6 * 0.4
   4    11      20      24         | 40 * 1 * 0.6
   5    12      20      16         | 40 * 1 * 0.4
   6    11      31      3.6        | 30 * 0.6 * 0.2
   7    11      32      9          | 30 * 0.6 * 0.5
   8    11      33      5.4        | 30 * 0.6 * 0.3
   9    12      31      2.4        | 30 * 0.4 * 0.2
  10    12      32      6          | 30 * 0.4 * 0.5
  11    12      33      3.6        | 30 * 0.4 * 0.3
   12   31      11      9.6        | 80 * 0.2 * 0.6
   13   32      11      24         | 80 * 0.5 * 0.6
   14   33      11      14.4       | 80 * 0.3 * 0.6 
   15   31      12      6.4        | 80 * 0.2 * 0.4 
   16   32      12      16         | 80 * 0.5 * 0.4 
   17   33      12      9.6        | 80 * 0.3 * 0.4 
   18   31      20      16         | 80 * 0.2 * 1
   19   32      20      40         | 80 * 0.5 * 1
   20   33      20      24         | 80 * 0.3 * 1

Below is the code I am using right now which does the job. However, the example I have given is only a sample of data. There are several thousand rows so I need to optimize it somehow.

l=[]

for i in range(len(df)) :
    print(i)

    ## check if both A2 and A1 has changes in df. 
    
    if weights['old'].isin([df.loc[i,'A1']]).any().any() == True and weights['old'].isin([df.loc[i,'A2']]).any().any() == True:

            #print('BOTH CHANGED' ,df.loc[i,'A1'], df.loc[i,'A2'])   
            for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):

                    for k in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
                        n=[None]*3

                        n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
                        n[1]=weights[weights['old']==df.loc[i,'A2']].values[k][1]    
                        n[2]=df.loc[i,'value']*weights.loc[weights['old']==df.loc[i,'A1']].values[j][2]*weights[weights['old']==df.loc[i,'A2']].values[k][2]
                    
                        l.append(n)
    
    ## check if only A1 has any changes. 
              
    
    elif weights['old'].isin([df.loc[i,'A1']]).any().any() == True:

                for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):

                    n=[None]*3

                    n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
                    n[1]=df.loc[i,'A2']
                    n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A1']].values[j][2]
                

                    l.append(n)  
                    

        
    
    ## check if only A2 has any changes. 

    
    elif weights['old'].isin([df.loc[i,'A2']]).any().any() == True and weights['old'].isin([df.loc[i,'A1']]).any().any() == False:
   


                for j in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
                    #print(j)
                    n=[None]*3
                
                    n[0]=df.loc[i,'A1']
                    n[1]=weights[weights['old']==df.loc[i,'A2']].values[j][1]
                    n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A2']].values[j][2]
                

                    l.append(n)   


                    
cols =  ['A1_new','A2_new', 'value_new' ]

new_df = pd.DataFrame(l, columns=cols)

BStadlbauer · Accepted Answer

Preface

So as far as I understand, you would like to split up each combination of (A1, A2) into all of its possible new combinations (A1_new, A2_new) each with their respective new value based on their weights.

From your desired output I found a small inconsistency, when you have the first row:

     A1  A2  value
0    1   1   50

is it really true that you would only these two resulting rows:

        A1_new  A2_new  value_new  
   0    11      11      30        
   1    12      12      20

and not four rows (e.g. all possible pairs)

    A1_new  A2_new  value_new
0       11      11       18.0
1       11      12       12.0
2       12      11       12.0
3       12      12        8.0

if that is really the case, please add a comment about the exact rule when this happens and I will edit my answer.

Solution

As said above, the following solutions uses all available pairs of A1 and A2 replacements. The general idea is to "replace" all old values of the columns using an inner join on both A columns and then compute the new value based on the joined weights.

df = df\
    .merge(weights, left_on="A1", right_on="old") \
    .drop(columns=["A1", "old"]) \
    .rename(columns={"new": "A1_new"}) \
    .merge(weights, left_on="A2", right_on="old") \
    .drop(columns=["A2", "old"]) \
    .rename(columns={"new": "A2_new"}) 

df["value_new"] = df["value"] * df["Weight_x"] * df["Weight_y"]
df = df.drop(columns=["value", "Weight_x", "Weight_y"])

Resulting data

    A1_new  A2_new  value_new
0       11      11       18.0
1       11      12       12.0
2       12      11       12.0
3       12      12        8.0
4       20      11        6.0
5       20      12        4.0
6       31      11        9.6
7       31      12        6.4
8       32      11       24.0
9       32      12       16.0
10      33      11       14.4
11      33      12        9.6
12      11      20       24.0
13      12      20       16.0
14      20      20       20.0
15      31      20       18.0
16      32      20       45.0
17      33      20       27.0
18      11      31        3.6
19      11      32        9.0
20      11      33        5.4
21      12      31        2.4
22      12      32        6.0
23      12      33        3.6
24      20      31       14.0
25      20      32       35.0
26      20      33       21.0

How to optimize iteration over pandas dataframe when vectorization is not possible

Answers (1)

Preface

Solution

Resulting data

Related Questions