haneulkim
haneulkim

Reputation: 4928

Full outer join on two multiindex dataframes (one with multi level columns) not working properly pandas

Using Python 3.8, pandas 1.1.2

I have two dateframes with multiindex

df1 (multi level column):

                       user      price
                       count     sum
name    date    hour 
  A      9/17    1       33       34
  A      9/17    2       66       55
  A      9/17    3       77       2
  A      9/17    4       88       1
 

df2:

                       seller_count
name    date    hour 
  A      9/17    1        100 
  A      9/17    15        66 

I am trying to do full outer join on two of them.

Desired output:

                       user      price
                       count     sum        seller_count
name    date    hour 
  A      9/17    1       33       34            100
  A      9/17    2       66       55            null
  A      9/17    3       77       2             null
  A      9/17    4       88       1             null
  A      9/17    15     Null     Null           66

I am trying to find out a way to do this without resetting indexes. Any help? Thanks!

solution from Pandas Dataframe Multiindex Merge does not seem to work, I am only able to get seller_count if it has same name, date,hour as df1.

df1.columns outputs:

MultiIndex([(          'user',    'count'),
            (           'price',    'sum')])

df2.columns outputs:

Index(["seller_count"])

Upvotes: 1

Views: 733

Answers (2)

jezrael
jezrael

Reputation: 863226

Setup:

print (df1.index)
MultiIndex([('A', '9/17', 1),
            ('A', '9/17', 2),
            ('A', '9/17', 3),
            ('A', '9/17', 4)],
           names=['name', 'date', 'hour'])

print (df1.columns)
MultiIndex([( 'user', 'count'),
            ('price',   'sum')],
           )

print (df2.index)
MultiIndex([('A', '9/17',  1),
            ('A', '9/17', 15)],
           names=['name', 'date', 'hour'])

print (df2.columns)
Index(['seller_count'], dtype='object')

First is necessary create MultiIndex in df2, then use merge with outer join:

df2.columns = pd.MultiIndex.from_product([[''], df2.columns])

print (df2.columns)
MultiIndex([('', 'seller_count')],
           )

df = df1.merge(df2, left_index=True, right_index=True, how="outer")
print (df)
                user price             
               count   sum seller_count
name date hour                         
A    9/17 1     33.0  34.0        100.0
          2     66.0  55.0          NaN
          3     77.0   2.0          NaN
          4     88.0   1.0          NaN
          15     NaN   NaN         66.0

df = df1.join(df2, how="outer")
print (df)
                user price             
               count   sum seller_count
name date hour                         
A    9/17 1     33.0  34.0        100.0
          2     66.0  55.0          NaN
          3     77.0   2.0          NaN
          4     88.0   1.0          NaN
          15     NaN   NaN         66.0

print (df.columns)
MultiIndex([( 'user',        'count'),
            ('price',          'sum'),
            (     '', 'seller_count')],
           )


print (df.index)
MultiIndex([('A', '9/17',  1),
            ('A', '9/17',  2),
            ('A', '9/17',  3),
            ('A', '9/17',  4),
            ('A', '9/17', 15)],
           names=['name', 'date', 'hour'])

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 31011

I assume that column names in the index in df1 are of "single level". You can achieve it the following way:

  1. The source file contains:

    name,date,hour,user,price
     , , ,count,sum
    A,9/17,1,33,34
    A,9/17,2,66,55
    A,9/17,3,77,2
    A,9/17,4,88,1
    

    Note spaces as first 3 column names at the second level.

  2. Read the file executing:

    df1 = pd.read_csv('Input_1.csv', header=[0,1])
    df1 = df1.set_index([('name', ' '), ('date', ' '), ('hour', ' ')])\
        .rename_axis(index=['name', 'date', 'hour'])
    

This way "2-level" column names, after setting as the index, get single level names.

Another detail to note is that:

  • index column names in both DataFrames are of single level,
  • df1 has a MultiIndex on columns,
  • df2 has an ordinary (single level) index on columns,
  • the result should have MultiIndex on columns.

To perform the join, you have to start from adding a MultiIndex level to the column index in df2 (with a space as the top level):

df2.columns = pd.MultiIndex.from_product([[' '], df2.columns])

Then perform ordinary outer join:

result = df1.join(df2, how='outer')

The result is:

                user price             
               count   sum seller_count
name date hour                         
A    9/17 1     33.0  34.0        100.0
          2     66.0  55.0          NaN
          3     77.0   2.0          NaN
          4     88.0   1.0          NaN
          15     NaN   NaN         66.0

Upvotes: 0

Related Questions