Reputation: 535
If I try this (with two different vintages of pandas
, one in Python 2, another in Python 3)
import pandas as pd
x = pd.DataFrame({"id": [1, 2,3], "value1": [5,5,5]})
y = pd.DataFrame({"id": [1], "value2": [10]})
z1 = x.join(y, on = "id")
z2 = x.join(y, on = "id", lsuffix = "_left", rsuffix = "_right")
z3 = x.join(y, lsuffix = "_left", rsuffix = "_right")
the first join fails with a ValueError
, the second one does not break but y
is not matched, and only the third join produces the expected result, i.e. y
's row matched up to x
.
The doc for join
says
on : name, tuple/list of names, or array-like Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.
Is this (i.e. what happens with z2
) a bug, or does it somehow make sense?
Upvotes: 2
Views: 123
Reputation: 881037
df.join(...)
is usually used to join on the index of df
with the index of another DataFrame.
df.join(..., on='id')
joins the id
column of df
with the index of another DataFrame.
Per the docs (my emphasis):
on : name, tuple/list of names, or array-like
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
Since x
and y
look like this:
In [14]: x
Out[14]:
id value1
0 1 5
1 2 5
2 3 5
In [15]: y
Out[15]:
id value2
0 1 10
x.join(y, on='id')
attempts to join x['id']
(with values 1, 2, 3
) with y.index
(with value 0
). Since x['id']
and y.index
have no values in common, the (by default) left join produces NaNs for the values in the new y
-columns produced by the join.
z1 = x.join(y, on = "id")
raises
ValueError: columns overlap but no suffix specified: Index(['id'], dtype='object')
because the y
-columns produced by the join include id
which is already a
x
-column name. When column names overlap, you must specify a lsuffix
,
rsuffix
, or both to disambiguate the column names.
z2 = x.join(y, on = "id", lsuffix = "_left", rsuffix = "_right")
returns
In [12]: z2
Out[12]:
id_left value1 id_right value2
0 1 5 NaN NaN
1 2 5 NaN NaN
2 3 5 NaN NaN
because the common x
and y
-column (i.e. the id
column) has been disambiguated.
The NaN
values are due to x['id']
and y.index
having no values in common (as explained above).
z3 = x.join(y, lsuffix = "_left", rsuffix = "_right")
produces
In [20]: z3
Out[20]:
id_left value1 id_right value2
0 1 5 1.0 10.0
1 2 5 NaN NaN
2 3 5 NaN NaN
because now the join is being performed on x.index
and y.index
.
Upvotes: 1