Explain how pandas DataFrame join works

Question

Why does inner join work so strange in pandas?

For example:

import pandas as pd
import io

t1 = ('key,col1
'
      '1,a
'
      '2,b
'
      '3,c
'
      '4,d')

t2 = ('key,col2
'
      '1,e
'
      '2,f
'
      '3,g
'
      '4,h')


df1 = pd.read_csv(io.StringIO(t1), header=0)
df2 = pd.read_csv(io.StringIO(t2), header=0)

print(df1)
print()
print(df2)
print()
print(df2.join(df1, on='key', how='inner', lsuffix='_l'))

Outputs:

   key col1
0    1    a
1    2    b
2    3    c
3    4    d

   key col2
0    1    e
1    2    f
2    3    g
3    4    h

   key_l col2  key col1
0      1    e    2    b
1      2    f    3    c
2      3    g    4    d

If I don't specify lsuffix, it says

ValueError: columns overlap but no suffix specified: Index(['key'], dtype='object')

Does this function work differently from SQL's JOIN? Why does it want to create an extra 'key' column with a suffix? Why are there only 3 rows? I expected it to output something like this:

   key col1 col2
0    1    a    e
1    2    b    f
2    3    c    g
3    4    d    h

piRSquared · Accepted Answer

First things first:
What you wanted was merge

df1.merge(df2)

join defaults to merging on the index. You can specify the on parameter which only says which column from left side to match with the index of the right side.

These might help illustrate

df1.set_index('key').join(df2.set_index('key'))

df1.join(df2.set_index('key'), on='key')

Your example is matching the index of df2 which looks like [0, 1, 2, 3] with the key column of df1 which looks like [1, 2, 3, 4]
That's why you get NaN in col2 when key_l is 4

df1.join(df2, on='key', lsuffix='_l', how='outer')

Explain how pandas DataFrame join works

Answers (1)

Related Questions