Run2
Run2

Reputation: 1919

Pandas merge giving wrong output

Ok I have gone through some blogs related to this topic - but I am still getting the same problem. I have two dataframes. Both have a column X which have SHA2 values in them. It contains hex strings.

Example (Dataframe lookup)

X,Y
000000000E000394574D69637264736F66742057696E646F7773204861726477,7
0000000080000000000000090099000000040005000000000000008F2A000010,7
000000020000000000000000777700010000000000020000000040C002004600,24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64,7

Example (Dataframe source)

X,Z
000000000E000394574D69637264736F66742057696E646F7773204861726477,'blah'
0000000080000000000000090099000000040005000000000000008F2A000010,'blah blah'
000000020000000000000000777700010000000000020000000040C002004600,'dummy'

etc.

So now I am doing

lookup['X'] = lookup['X'].astype(str)
source['X'] = source['X'].astype(str)
source['newcolumn'] = source.merge(lookup, on='X', how='inner')['Y']

The source has 160,000 rows and the lookup has around 500,000 rows.

Now, when the operation finishes, I get newcolumn but the values are wrong. I have made sure that they are not being picked up from duplicate values of X, because there are no duplicate X in either table.

So, this is really making me feel dumb and gave me quite a pain in my live systems. Can anyone suggest what is the problem ?

I have now replaced the call with

def getReputation(lookupDF,value,lookupcolumn,default):
    lookupRows = lookupDF.loc[lookupDF['X']==value]
    if lookupRows.shape[0]>0:
        return lookupRows[lookupcolumn].values[0]
    else:
        return default

source['newcolumn'] = source.apply(lambda x: getReputation(lookup,x['X'],'Y',-1),axis=1)

This code works - but obviously it is BAD code and takes a horrible long time. I can multiprocess it - but the question remains. WHY is the merge failing ?

Thanks for your help Rgds

Upvotes: 1

Views: 2668

Answers (2)

Ganesh Rajan
Ganesh Rajan

Reputation: 11

Probably you might have duplicate values on column X on the lookup data frame. It is due to the indexing of fields and the below snippet will produce the right results.

output = source.merge(lookup, on='X', how='inner')

In case if you want to create a new column, either it should not have any duplicates on the right df or the indexes needs to be adjusted accordingly. If you're sure there are no duplicate values, compare the indexes from the above snippet and your snippet for better understanding and try resetting the indexes before merging.

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

I'd use map() method in this case:

first set 'X' as index in the lookup DF:

In [58]: lookup.set_index('X', inplace=True)

In [59]: lookup
Out[59]:
                                                                   Y
X
000000000E000394574D69637264736F66742057696E646F7773204861726477   7
0000000080000000000000090099000000040005000000000000008F2A000010   7
000000020000000000000000777700010000000000020000000040C002004600  24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64   7

In [60]: df['Y'] = df.X.map(lookup.Y)

In [61]: df
Out[61]:
                                                                  X          Z   Y
0  000000000E000394574D69637264736F66742057696E646F7773204861726477       blah   7
1  0000000080000000000000090099000000040005000000000000008F2A000010  blah blah   7
2  000000020000000000000000777700010000000000020000000040C002004600      dummy  24

Actually your code is working properly for your sample DFs:

In [68]: df.merge(lookup, on='X', how='inner')
Out[68]:
                                                                  X          Z   Y
0  000000000E000394574D69637264736F66742057696E646F7773204861726477       blah   7
1  0000000080000000000000090099000000040005000000000000008F2A000010  blah blah   7
2  000000020000000000000000777700010000000000020000000040C002004600      dummy  24

So check whether you have the same data and dtypes in the X column in both DFs

Upvotes: 3

Related Questions