Reputation: 1919
Ok I have gone through some blogs related to this topic - but I am still getting the same problem. I have two dataframes. Both have a column X which have SHA2 values in them. It contains hex strings.
Example (Dataframe lookup)
X,Y
000000000E000394574D69637264736F66742057696E646F7773204861726477,7
0000000080000000000000090099000000040005000000000000008F2A000010,7
000000020000000000000000777700010000000000020000000040C002004600,24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64,7
Example (Dataframe source)
X,Z
000000000E000394574D69637264736F66742057696E646F7773204861726477,'blah'
0000000080000000000000090099000000040005000000000000008F2A000010,'blah blah'
000000020000000000000000777700010000000000020000000040C002004600,'dummy'
etc.
So now I am doing
lookup['X'] = lookup['X'].astype(str)
source['X'] = source['X'].astype(str)
source['newcolumn'] = source.merge(lookup, on='X', how='inner')['Y']
The source has 160,000 rows and the lookup has around 500,000 rows.
Now, when the operation finishes, I get newcolumn but the values are wrong. I have made sure that they are not being picked up from duplicate values of X, because there are no duplicate X in either table.
So, this is really making me feel dumb and gave me quite a pain in my live systems. Can anyone suggest what is the problem ?
I have now replaced the call with
def getReputation(lookupDF,value,lookupcolumn,default):
lookupRows = lookupDF.loc[lookupDF['X']==value]
if lookupRows.shape[0]>0:
return lookupRows[lookupcolumn].values[0]
else:
return default
source['newcolumn'] = source.apply(lambda x: getReputation(lookup,x['X'],'Y',-1),axis=1)
This code works - but obviously it is BAD code and takes a horrible long time. I can multiprocess it - but the question remains. WHY is the merge failing ?
Thanks for your help Rgds
Upvotes: 1
Views: 2668
Reputation: 11
Probably you might have duplicate values on column X on the lookup data frame. It is due to the indexing of fields and the below snippet will produce the right results.
output = source.merge(lookup, on='X', how='inner')
In case if you want to create a new column, either it should not have any duplicates on the right df or the indexes needs to be adjusted accordingly. If you're sure there are no duplicate values, compare the indexes from the above snippet and your snippet for better understanding and try resetting the indexes before merging.
Upvotes: 1
Reputation: 210832
I'd use map() method in this case:
first set 'X' as index in the lookup
DF:
In [58]: lookup.set_index('X', inplace=True)
In [59]: lookup
Out[59]:
Y
X
000000000E000394574D69637264736F66742057696E646F7773204861726477 7
0000000080000000000000090099000000040005000000000000008F2A000010 7
000000020000000000000000777700010000000000020000000040C002004600 24
0000005BC614437F6BE049237FA1DDD2083B5BA43A10175E4377A59839DC2B64 7
In [60]: df['Y'] = df.X.map(lookup.Y)
In [61]: df
Out[61]:
X Z Y
0 000000000E000394574D69637264736F66742057696E646F7773204861726477 blah 7
1 0000000080000000000000090099000000040005000000000000008F2A000010 blah blah 7
2 000000020000000000000000777700010000000000020000000040C002004600 dummy 24
Actually your code is working properly for your sample DFs:
In [68]: df.merge(lookup, on='X', how='inner')
Out[68]:
X Z Y
0 000000000E000394574D69637264736F66742057696E646F7773204861726477 blah 7
1 0000000080000000000000090099000000040005000000000000008F2A000010 blah blah 7
2 000000020000000000000000777700010000000000020000000040C002004600 dummy 24
So check whether you have the same data and dtypes in the X
column in both DFs
Upvotes: 3