Pandas merge giving wrong output

Question

Ok I have gone through some blogs related to this topic - but I am still getting the same problem. I have two dataframes. Both have a column X which have SHA2 values in them. It contains hex strings.

Example (Dataframe lookup)

X,Y
000000000E000394574D69637264736F66742057696E646F7773204861726477,7
0000000080000000000000090099000000040005000000000000008F2A000010,7
000000020000000000000000777700010000000000020000000040C002004600,24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64,7

Example (Dataframe source)

X,Z
000000000E000394574D69637264736F66742057696E646F7773204861726477,'blah'
0000000080000000000000090099000000040005000000000000008F2A000010,'blah blah'
000000020000000000000000777700010000000000020000000040C002004600,'dummy'

etc.

So now I am doing

lookup['X'] = lookup['X'].astype(str)
source['X'] = source['X'].astype(str)
source['newcolumn'] = source.merge(lookup, on='X', how='inner')['Y']

The source has 160,000 rows and the lookup has around 500,000 rows.

Now, when the operation finishes, I get newcolumn but the values are wrong. I have made sure that they are not being picked up from duplicate values of X, because there are no duplicate X in either table.

So, this is really making me feel dumb and gave me quite a pain in my live systems. Can anyone suggest what is the problem ?

I have now replaced the call with

def getReputation(lookupDF,value,lookupcolumn,default):
    lookupRows = lookupDF.loc[lookupDF['X']==value]
    if lookupRows.shape[0]>0:
        return lookupRows[lookupcolumn].values[0]
    else:
        return default

source['newcolumn'] = source.apply(lambda x: getReputation(lookup,x['X'],'Y',-1),axis=1)

This code works - but obviously it is BAD code and takes a horrible long time. I can multiprocess it - but the question remains. WHY is the merge failing ?

Thanks for your help Rgds

MaxU - stand with Ukraine · Accepted Answer

I'd use map() method in this case:

first set 'X' as index in the lookup DF:

In [58]: lookup.set_index('X', inplace=True)

In [59]: lookup
Out[59]:
                                                                   Y
X
000000000E000394574D69637264736F66742057696E646F7773204861726477   7
0000000080000000000000090099000000040005000000000000008F2A000010   7
000000020000000000000000777700010000000000020000000040C002004600  24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64   7

In [60]: df['Y'] = df.X.map(lookup.Y)

In [61]: df
Out[61]:
                                                                  X          Z   Y
0  000000000E000394574D69637264736F66742057696E646F7773204861726477       blah   7
1  0000000080000000000000090099000000040005000000000000008F2A000010  blah blah   7
2  000000020000000000000000777700010000000000020000000040C002004600      dummy  24

Actually your code is working properly for your sample DFs:

In [68]: df.merge(lookup, on='X', how='inner')
Out[68]:
                                                                  X          Z   Y
0  000000000E000394574D69637264736F66742057696E646F7773204861726477       blah   7
1  0000000080000000000000090099000000040005000000000000008F2A000010  blah blah   7
2  000000020000000000000000777700010000000000020000000040C002004600      dummy  24

So check whether you have the same data and dtypes in the X column in both DFs

Pandas merge giving wrong output

Answers (2)

Related Questions