Reputation: 5773
After a serious misconception of how on
works in join
(spoiler: very different from on
in merge
), this is my example code.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
print(df1.merge(df2, on="fruit", how="left"))
I get a KeyError
. How do I correctly reference variables.fruit
here?
To understand what I am after, consider the same problem without a multi index:
import pandas as pd
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=["number", "fruit"])
df2 = pd.DataFrame([["banana", "yellow"]], columns=["fruit", "color"])
# this is obviously incorrect as it uses indexes on `df1` as well as `df2`:
print(df1.join(df2, rsuffix="_"))
# this is *also* incorrect, although I initially thought it should work, but it uses the index on `df2`:
print(df1.join(df2, on="fruit", rsuffix="_"))
# this is correct:
print(df1.merge(df2, on="fruit", how="left"))
The expected and wanted result is this:
number fruit color
0 one apple NaN
1 two banana yellow
How do I get the same when fruit
is part of a multi index?
Upvotes: 0
Views: 739
Reputation: 36623
I think I understand what you are trying to accomplish now, and I don't think join
is going to get you there. Both DataFrame.join
and DataFrame.merge
make a call to pandas.core.reshape.merge.merge
, but using DataFrame.merge
gives you more control over what defaults are applied.
In your case, you can use reference the columns to join on via a list of tuples, where the elements of the tuple are the levels of the multi-indexed columns. I.e. to use the variables / fruit
column, you can pass [('variables', 'fruit')]
.
Using tuples is how you index into multi-index columns (and row indices). You need to wrap it in a list because merge operations can be performed using multiple columns, or multiple multi-indexed columns, like a JOIN statement in SQL. Passing a single string is just a convenience case that gets wrapped in a list for you.
Since you are only joining on 1 column, it is a list of a single tuple.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
df1.merge(df2, how='left', on=[('variables', 'fruit')])
# returns:
variables
number fruit color
0 one apple NaN
1 two banana yellow
Upvotes: 1