Dimitri Shvorob
Dimitri Shvorob

Reputation: 535

Pandas "join" oddity

If I try this (with two different vintages of pandas, one in Python 2, another in Python 3)

import pandas as pd
x = pd.DataFrame({"id": [1, 2,3], "value1": [5,5,5]})
y = pd.DataFrame({"id": [1], "value2": [10]})

z1 = x.join(y, on = "id")
z2 = x.join(y, on = "id", lsuffix = "_left", rsuffix = "_right")
z3 = x.join(y, lsuffix = "_left", rsuffix = "_right")

the first join fails with a ValueError, the second one does not break but y is not matched, and only the third join produces the expected result, i.e. y's row matched up to x.

The doc for join says

on : name, tuple/list of names, or array-like Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

Is this (i.e. what happens with z2) a bug, or does it somehow make sense?

Upvotes: 2

Views: 123

Answers (1)

unutbu
unutbu

Reputation: 881037

df.join(...) is usually used to join on the index of df with the index of another DataFrame.

df.join(..., on='id') joins the id column of df with the index of another DataFrame. Per the docs (my emphasis):

on : name, tuple/list of names, or array-like

Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

Since x and y look like this:

In [14]: x
Out[14]: 
   id  value1
0   1       5
1   2       5
2   3       5

In [15]: y
Out[15]: 
   id  value2
0   1      10

x.join(y, on='id') attempts to join x['id'] (with values 1, 2, 3) with y.index (with value 0). Since x['id'] and y.index have no values in common, the (by default) left join produces NaNs for the values in the new y-columns produced by the join.


z1 = x.join(y, on = "id") raises

ValueError: columns overlap but no suffix specified: Index(['id'], dtype='object')

because the y-columns produced by the join include id which is already a x-column name. When column names overlap, you must specify a lsuffix, rsuffix, or both to disambiguate the column names.


z2 = x.join(y, on = "id", lsuffix = "_left", rsuffix = "_right") returns

In [12]: z2
Out[12]: 
   id_left  value1  id_right  value2
0        1       5       NaN     NaN
1        2       5       NaN     NaN
2        3       5       NaN     NaN

because the common x and y-column (i.e. the id column) has been disambiguated. The NaN values are due to x['id'] and y.index having no values in common (as explained above).


z3 = x.join(y, lsuffix = "_left", rsuffix = "_right") produces

In [20]: z3
Out[20]: 
   id_left  value1  id_right  value2
0        1       5       1.0    10.0
1        2       5       NaN     NaN
2        3       5       NaN     NaN

because now the join is being performed on x.index and y.index.

Upvotes: 1

Related Questions