Alfonso
Alfonso

Reputation: 11

Replace values of a DataFrame with values of a dictionary

I have two DataFrames with many thousands of rows. The columns all have the dtype string. A snippet looks like this:

DF1:

                 ID   SUCCESSOR 
    0       0001234     3620031
    1       0001235     6640002
    2       0002456     8620003
    3       0013456     8640004
    4       1711999     1283456 <- see DF2
    ...         ...         ... 
    409813  9162467        <NA>
    409814  9212466        <NA>
    409815  9312466     6975A0C
    409816  9452463        <NA>
    409817  9591227        <NA>

DF2:

                 ID
    2       1111682
    3       1123704
    14      1567828
    15      1711999 <- that value should be replaced with '1283456'
    16      1711834
    ...         ...
    845775  970879B
    845776  975879B
    845777  9275A0A
    845778  9285A05
    845779  9295A05

Don't be surprised that the second DataFrame is missing a few indexes, because I filtered them before, as they were not relevant. Also, the NaNs are not relevant because my algorithm bypasses them.

I now want to replace the IDs in the second DataFrame with the successors in the first DataFrame that have the same ID.

The output should be:

                 ID
    2       1111682
    3       1123704
    14      1567828
    15      1283456 <- now replaced
    16      1711834
    ...         ...
    845775  970879B
    845776  975879B
    845777  9275A0A
    845778  9285A05
    845779  9295A05

In order not to blow up the example, I have replaced only one value. In reality there are several replacements

Two approaches:

  1. In my first approach I iterated through DF1 and used the replace() function, however this approach takes years, so its useless.

  2. In my second approach, I first convert DF1 to a dictionary and then apply the map() function. I do this as described here by JohnE: Remap values in pandas column with a dict and in a small example it works wonderfully:

     df = pd.DataFrame({'col1': {1: 1, 2: 2, 3: 4, 4: 1}, 'col2': {1: 2, 2: np.nan}})
     di = {1: "A", 2: "B"}
    
     col1  col2
     1     1   2.0
     2     2   NaN
     3     4   NaN
     4     1   NaN
    
     df['col1'].map(di).fillna(df['col1'])
    
     1    A
     2    B
     3    4
     4    A
    

My function to map DF1 and DF2 looks like this:

    def mapping(df1, df2):   

        di =dict(zip(df1.ID, df1.SUCCESSOR)) # create the dict
        changes = 1   

        while(changes > 0):        
    
            changes = 0
            df_old = df2                    
            print(df2) #check how df2 looks before mapping.
            df2['ID'] = df2['ID'].map(di).fillna(df2['ID'])                   
            print(df2) # check how df2 looks after mapping. Unfortunately no changes :( so the error must be in the mapping function one line above here.
            if df_old.equals(df2) == False:
                changes = 1    
    
        return df2

So obviously the error must be in this line:

    df2['ID'] = df2['ID'].map(dic).fillna(df2['ID']).

However, I just can't figure out why this doesn't work. What is not working here and why?

If anyone can help me, I am indebted to them forever!!!

Best regards, Alfonso

EDIT: EDIT: I found the error and I am an idiot. My solution worked, but the line: "df_old = df2" prevented the loop from continuing. MANY THANKS anyway and sorry if I have robbed time!

Upvotes: 1

Views: 220

Answers (1)

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

Here is a oneliner that creates the replacement dict by filtering the dataframes:

df2['ID'] = df2['ID'].replace(dict(zip(df2[df2['ID'].isin(df1['ID'])].sort_values(by=['ID']).reset_index()['ID'], df1.loc[df1['ID'].isin(df2['ID'])].sort_values(by=['ID']).reset_index()['SUCCESSOR'])))

Upvotes: 1

Related Questions