Reputation: 1135
Why does inner join work so strange in pandas?
For example:
import pandas as pd
import io
t1 = ('key,col1\n'
'1,a\n'
'2,b\n'
'3,c\n'
'4,d')
t2 = ('key,col2\n'
'1,e\n'
'2,f\n'
'3,g\n'
'4,h')
df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)
print(df1)
print()
print(df2)
print()
print(df2.join(df1, on='key', how='inner', lsuffix='_l'))
Outputs:
key col1
0 1 a
1 2 b
2 3 c
3 4 d
key col2
0 1 e
1 2 f
2 3 g
3 4 h
key_l col2 key col1
0 1 e 2 b
1 2 f 3 c
2 3 g 4 d
If I don't specify lsuffix
, it says
ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')
Does this function work differently from SQL's JOIN? Why does it want to create an extra 'key' column with a suffix? Why are there only 3 rows? I expected it to output something like this:
key col1 col2
0 1 a e
1 2 b f
2 3 c g
3 4 d h
Upvotes: 3
Views: 613
Reputation: 294258
First things first:
What you wanted was merge
df1.merge(df2)
join
defaults to merging on the index
. You can specify the on
parameter which only says which column from left side to match with the index of the right side.
These might help illustrate
df1.set_index('key').join(df2.set_index('key'))
df1.join(df2.set_index('key'), on='key')
Your example is matching the index of df2
which looks like [0, 1, 2, 3]
with the key
column of df1
which looks like [1, 2, 3, 4]
That's why you get NaN
in col2
when key_l
is 4
df1.join(df2, on='key', lsuffix='_l', how='outer')
Upvotes: 3