Reputation: 11
I have two DataFrames with many thousands of rows. The columns all have the dtype string. A snippet looks like this:
DF1:
ID SUCCESSOR
0 0001234 3620031
1 0001235 6640002
2 0002456 8620003
3 0013456 8640004
4 1711999 1283456 <- see DF2
... ... ...
409813 9162467 <NA>
409814 9212466 <NA>
409815 9312466 6975A0C
409816 9452463 <NA>
409817 9591227 <NA>
DF2:
ID
2 1111682
3 1123704
14 1567828
15 1711999 <- that value should be replaced with '1283456'
16 1711834
... ...
845775 970879B
845776 975879B
845777 9275A0A
845778 9285A05
845779 9295A05
Don't be surprised that the second DataFrame is missing a few indexes, because I filtered them before, as they were not relevant. Also, the NaNs are not relevant because my algorithm bypasses them.
I now want to replace the IDs in the second DataFrame with the successors in the first DataFrame that have the same ID.
The output should be:
ID
2 1111682
3 1123704
14 1567828
15 1283456 <- now replaced
16 1711834
... ...
845775 970879B
845776 975879B
845777 9275A0A
845778 9285A05
845779 9295A05
In order not to blow up the example, I have replaced only one value. In reality there are several replacements
Two approaches:
In my first approach I iterated through DF1 and used the replace() function, however this approach takes years, so its useless.
In my second approach, I first convert DF1 to a dictionary and then apply the map() function. I do this as described here by JohnE: Remap values in pandas column with a dict and in a small example it works wonderfully:
df = pd.DataFrame({'col1': {1: 1, 2: 2, 3: 4, 4: 1}, 'col2': {1: 2, 2: np.nan}})
di = {1: "A", 2: "B"}
col1 col2
1 1 2.0
2 2 NaN
3 4 NaN
4 1 NaN
df['col1'].map(di).fillna(df['col1'])
1 A
2 B
3 4
4 A
My function to map DF1 and DF2 looks like this:
def mapping(df1, df2):
di =dict(zip(df1.ID, df1.SUCCESSOR)) # create the dict
changes = 1
while(changes > 0):
changes = 0
df_old = df2
print(df2) #check how df2 looks before mapping.
df2['ID'] = df2['ID'].map(di).fillna(df2['ID'])
print(df2) # check how df2 looks after mapping. Unfortunately no changes :( so the error must be in the mapping function one line above here.
if df_old.equals(df2) == False:
changes = 1
return df2
So obviously the error must be in this line:
df2['ID'] = df2['ID'].map(dic).fillna(df2['ID']).
However, I just can't figure out why this doesn't work. What is not working here and why?
If anyone can help me, I am indebted to them forever!!!
Best regards, Alfonso
EDIT: EDIT: I found the error and I am an idiot. My solution worked, but the line: "df_old = df2" prevented the loop from continuing. MANY THANKS anyway and sorry if I have robbed time!
Upvotes: 1
Views: 220
Reputation: 9619
Here is a oneliner that creates the replacement dict by filtering the dataframes:
df2['ID'] = df2['ID'].replace(dict(zip(df2[df2['ID'].isin(df1['ID'])].sort_values(by=['ID']).reset_index()['ID'], df1.loc[df1['ID'].isin(df2['ID'])].sort_values(by=['ID']).reset_index()['SUCCESSOR'])))
Upvotes: 1