bers
bers

Reputation: 5773

How to specify hierarchical columns in Pandas merge?

After a serious misconception of how on works in join (spoiler: very different from on in merge), this is my example code.

import pandas as pd

index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)

index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)

print(df1.merge(df2, on="fruit", how="left"))

I get a KeyError. How do I correctly reference variables.fruit here?

To understand what I am after, consider the same problem without a multi index:

import pandas as pd

df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=["number", "fruit"])
df2 = pd.DataFrame([["banana", "yellow"]], columns=["fruit", "color"])

# this is obviously incorrect as it uses indexes on `df1` as well as `df2`:
print(df1.join(df2, rsuffix="_"))

# this is *also* incorrect, although I initially thought it should work, but it uses the index on `df2`:
print(df1.join(df2, on="fruit", rsuffix="_"))

# this is correct:
print(df1.merge(df2, on="fruit", how="left"))

The expected and wanted result is this:

  number   fruit   color
0    one   apple     NaN
1    two  banana  yellow

How do I get the same when fruit is part of a multi index?

Upvotes: 0

Views: 739

Answers (1)

James
James

Reputation: 36623

I think I understand what you are trying to accomplish now, and I don't think join is going to get you there. Both DataFrame.join and DataFrame.merge make a call to pandas.core.reshape.merge.merge, but using DataFrame.merge gives you more control over what defaults are applied.

In your case, you can use reference the columns to join on via a list of tuples, where the elements of the tuple are the levels of the multi-indexed columns. I.e. to use the variables / fruit column, you can pass [('variables', 'fruit')].

Using tuples is how you index into multi-index columns (and row indices). You need to wrap it in a list because merge operations can be performed using multiple columns, or multiple multi-indexed columns, like a JOIN statement in SQL. Passing a single string is just a convenience case that gets wrapped in a list for you.

Since you are only joining on 1 column, it is a list of a single tuple.

import pandas as pd

index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)

index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)

df1.merge(df2, how='left', on=[('variables', 'fruit')])
# returns:
  variables
     number   fruit   color
0       one   apple     NaN
1       two  banana  yellow

Upvotes: 1

Related Questions