Reputation: 11575
I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:
>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')
I try to merge them using this:
>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')
The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:
>>> merge_res.shape
(0, 19)
But when I try to match a single element, I see this really odd behavior.
# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'
# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result
# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!
So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..
Any ideas what's going on?
It's almost as thought Pandas converts df1.col1
to an integer just because it can, even though it should be treated as a string while matching.
(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)
Upvotes: 49
Views: 113470
Reputation: 1
This step:
df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()
and then doing this step:
pd.merge(df1.assign(x=df1.x.astype(str)),
df2.assign(x=df2.x.astype(str)),
how='left', on='x')
worked for me. I meant BOTH together.
Upvotes: 0
Reputation: 6097
This answer solved it for me:
pd.merge(df1.assign(x=df1.x.astype(str)),
df2.assign(x=df2.x.astype(str)),
how='left', on='x')
from Pandas merge issue on key of object type containing number and string values
Upvotes: 3
Reputation: 381
Probably your columns have some difference or a blank space that is resulting on this error.
First check your column type and if there is any different between items
df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')
If there is any difference on them, you can use
df1.col1 = df1.col1.str.replace("this", "for that")
or if there is any blank space
df1.col1 = df1.col1.apply(str).str.strip()
# This apply(str) is being used because without it, the program returns an error related to being enable to convert from byte.
Upvotes: 2
Reputation: 41
None of the above solutions worked for me because the merge was actually done correctly but the indexing got messed up. Removing the index solved it for me:
df['sth'] = df.merge(df2, how='left', on=['x', 'y'])['sth'].values
Upvotes: 4
Reputation: 2718
Thanks,@seeiespi the ..str.encode('utf-8') has helped me to figure out that my string needs to be stripped, as below
20 b'Belize ' ... 0,612
21 b'Benin ' ... 0,546
The solution was to use the strip
df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()
Upvotes: 16
Reputation: 3828
I ran into a case where the df.col = df.col.astype(str)
solution did not work. Turns out the problem was in the encoding.
My original data looked like this:
In [72]: df1['col1'][:3]
Out[73]:
col1
0 dustin pedroia
1 kevin youkilis
2 david ortiz
In [72]: df2['col2'][:3]
Out[73]:
col2
0 dustin pedroia
1 kevin youkilis
2 david ortiz
And after using .astype(str)
the merge still wasn't working so I executed the following:
df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')
and was able to find the difference:
In [95]: df1
Out[95]:
col1
0 b'dustin\xc2\xa0pedroia'
1 b'kevin\xc2\xa0youkilis'
2 b'david\xc2\xa0ortiz'
In [95]: df2
Out[95]:
col2
0 b'dustin pedroia'
1 b'kevin youkilis'
2 b'david ortiz'
At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ')
on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')
) and the merge worked perfectly.
NOTE: Regardless of what I was replacing I always used .str.encode('utf-8')
to check whether it worked.
Alternatively
Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.
import re
#places the raw string into a list
df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))
df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))
where my df1 data turned into this (copied and pasted from Spyder):
['dustin\xa0pedroia']
['kevin\xa0youkilis']
['david\xa0ortiz']
which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)
Upvotes: 28
Reputation: 11575
The issue was that the object
dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.
The solution was to make sure that every field is a string:
>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)
Then the merge works as expected.
(I wish there was a way of specifying a dtype
of str
...)
Upvotes: 77