Reputation: 608
I have a rather complex computation to make which I want to optimize. I would leverage pandas/numpy vectorization but not sure if it possible to solve this task with vectorization. Currently I am using pandas dataframe but using for loop to iterate over dataframe.
Task Explanation:
The lookup table gives the mapping. For example the old
"1" is split between new
"11" and "12" with the split proportion given as weight
. 60% of "1" was passed to "11" and 40% was passed to "12".
Another example is where old
"2" was renamed to new
"20". So the split ratio here is 100%.
Similarly for old
"3" which is split into new
"31", "32", "33".
The actual values for "1", "2", "3" are given in table df
. In the resulting table new_df
the old
values need to be replaced with new
by multiplying the respective weights
/ratio.
I hope this explanation will suffice. The tables below will help more.
weights (it is a lookup table)
data = {'old':['1','1','2','3','3','3'], 'new':['11','12', '20','31','32','33'], 'Weight': [0.6, 0.4,1 ,0.2, 0.5, 0.3]}
weights = pd.DataFrame(data)
weights
old new Weight
0 1 11 0.6
1 1 12 0.4
2 2 20 1
3 3 31 0.2
4 3 32 0.5
5 3 33 0.3
df ( actual data table/ matrix)
data = {'A1':['1','1', '1', '2','2','2', '3','3'], 'A2':['1','2','3','1','2','3','1','2'], 'value': [50, 40, 30 ,10, 20, 70, 80, 90]}
df = pd.DataFrame(data)
df
A1 A2 value
0 1 1 50
1 1 2 40
2 1 3 30
3 2 1 10
4 2 2 20
5 2 3 70
6 3 1 80
7 3 2 90
new_df ( resulting dataframe)
A1_new A2_new value_new | calculation (only for explanation)
0 11 11 18 | 50 * 0.6 * 0.6
1 12 12 8 | 50 * 0.4 * 0.4
2 11 12 12 | 50 * 0.6 * 0.4
3 12 11 12 | 50 * 0.6 * 0.4
4 11 20 24 | 40 * 1 * 0.6
5 12 20 16 | 40 * 1 * 0.4
6 11 31 3.6 | 30 * 0.6 * 0.2
7 11 32 9 | 30 * 0.6 * 0.5
8 11 33 5.4 | 30 * 0.6 * 0.3
9 12 31 2.4 | 30 * 0.4 * 0.2
10 12 32 6 | 30 * 0.4 * 0.5
11 12 33 3.6 | 30 * 0.4 * 0.3
12 31 11 9.6 | 80 * 0.2 * 0.6
13 32 11 24 | 80 * 0.5 * 0.6
14 33 11 14.4 | 80 * 0.3 * 0.6
15 31 12 6.4 | 80 * 0.2 * 0.4
16 32 12 16 | 80 * 0.5 * 0.4
17 33 12 9.6 | 80 * 0.3 * 0.4
18 31 20 16 | 80 * 0.2 * 1
19 32 20 40 | 80 * 0.5 * 1
20 33 20 24 | 80 * 0.3 * 1
Below is the code I am using right now which does the job. However, the example I have given is only a sample of data. There are several thousand rows so I need to optimize it somehow.
l=[]
for i in range(len(df)) :
print(i)
## check if both A2 and A1 has changes in df.
if weights['old'].isin([df.loc[i,'A1']]).any().any() == True and weights['old'].isin([df.loc[i,'A2']]).any().any() == True:
#print('BOTH CHANGED' ,df.loc[i,'A1'], df.loc[i,'A2'])
for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):
for k in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
n=[None]*3
n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
n[1]=weights[weights['old']==df.loc[i,'A2']].values[k][1]
n[2]=df.loc[i,'value']*weights.loc[weights['old']==df.loc[i,'A1']].values[j][2]*weights[weights['old']==df.loc[i,'A2']].values[k][2]
l.append(n)
## check if only A1 has any changes.
elif weights['old'].isin([df.loc[i,'A1']]).any().any() == True:
for j in range(len(weights[weights['old']==df.loc[i,'A1']].values)):
n=[None]*3
n[0]=weights[weights['old']==df.loc[i,'A1']].values[j][1]
n[1]=df.loc[i,'A2']
n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A1']].values[j][2]
l.append(n)
## check if only A2 has any changes.
elif weights['old'].isin([df.loc[i,'A2']]).any().any() == True and weights['old'].isin([df.loc[i,'A1']]).any().any() == False:
for j in range(len(weights[weights['old']==df.loc[i,'A2']].values)):
#print(j)
n=[None]*3
n[0]=df.loc[i,'A1']
n[1]=weights[weights['old']==df.loc[i,'A2']].values[j][1]
n[2]=df.loc[i,'value']*weights[weights['old']==df.loc[i,'A2']].values[j][2]
l.append(n)
cols = ['A1_new','A2_new', 'value_new' ]
new_df = pd.DataFrame(l, columns=cols)
Upvotes: 3
Views: 308
Reputation: 1285
So as far as I understand, you would like to split up each combination of (A1, A2)
into all of its possible new combinations (A1_new, A2_new)
each with their respective new value
based on their weights.
From your desired output I found a small inconsistency, when you have the first row:
A1 A2 value
0 1 1 50
is it really true that you would only these two resulting rows:
A1_new A2_new value_new
0 11 11 30
1 12 12 20
and not four rows (e.g. all possible pairs)
A1_new A2_new value_new
0 11 11 18.0
1 11 12 12.0
2 12 11 12.0
3 12 12 8.0
if that is really the case, please add a comment about the exact rule when this happens and I will edit my answer.
As said above, the following solutions uses all available pairs of A1
and A2
replacements. The general idea is to "replace" all old values of the columns using an inner join on both A
columns and then compute the new value based on the joined weights.
df = df\
.merge(weights, left_on="A1", right_on="old") \
.drop(columns=["A1", "old"]) \
.rename(columns={"new": "A1_new"}) \
.merge(weights, left_on="A2", right_on="old") \
.drop(columns=["A2", "old"]) \
.rename(columns={"new": "A2_new"})
df["value_new"] = df["value"] * df["Weight_x"] * df["Weight_y"]
df = df.drop(columns=["value", "Weight_x", "Weight_y"])
A1_new A2_new value_new
0 11 11 18.0
1 11 12 12.0
2 12 11 12.0
3 12 12 8.0
4 20 11 6.0
5 20 12 4.0
6 31 11 9.6
7 31 12 6.4
8 32 11 24.0
9 32 12 16.0
10 33 11 14.4
11 33 12 9.6
12 11 20 24.0
13 12 20 16.0
14 20 20 20.0
15 31 20 18.0
16 32 20 45.0
17 33 20 27.0
18 11 31 3.6
19 11 32 9.0
20 11 33 5.4
21 12 31 2.4
22 12 32 6.0
23 12 33 3.6
24 20 31 14.0
25 20 32 35.0
26 20 33 21.0
Upvotes: 4