Reputation: 4928
Using Python 3.8, pandas 1.1.2
I have two dateframes with multiindex
df1 (multi level column):
user price
count sum
name date hour
A 9/17 1 33 34
A 9/17 2 66 55
A 9/17 3 77 2
A 9/17 4 88 1
df2:
seller_count
name date hour
A 9/17 1 100
A 9/17 15 66
I am trying to do full outer join on two of them.
Desired output:
user price
count sum seller_count
name date hour
A 9/17 1 33 34 100
A 9/17 2 66 55 null
A 9/17 3 77 2 null
A 9/17 4 88 1 null
A 9/17 15 Null Null 66
I am trying to find out a way to do this without resetting indexes. Any help? Thanks!
solution from Pandas Dataframe Multiindex Merge does not seem to work, I am only able to get seller_count
if it has same name, date,hour
as df1.
df1.columns outputs:
MultiIndex([( 'user', 'count'),
( 'price', 'sum')])
df2.columns outputs:
Index(["seller_count"])
Upvotes: 1
Views: 733
Reputation: 863226
Setup:
print (df1.index)
MultiIndex([('A', '9/17', 1),
('A', '9/17', 2),
('A', '9/17', 3),
('A', '9/17', 4)],
names=['name', 'date', 'hour'])
print (df1.columns)
MultiIndex([( 'user', 'count'),
('price', 'sum')],
)
print (df2.index)
MultiIndex([('A', '9/17', 1),
('A', '9/17', 15)],
names=['name', 'date', 'hour'])
print (df2.columns)
Index(['seller_count'], dtype='object')
First is necessary create MultiIndex
in df2
, then use merge with outer join
:
df2.columns = pd.MultiIndex.from_product([[''], df2.columns])
print (df2.columns)
MultiIndex([('', 'seller_count')],
)
df = df1.merge(df2, left_index=True, right_index=True, how="outer")
print (df)
user price
count sum seller_count
name date hour
A 9/17 1 33.0 34.0 100.0
2 66.0 55.0 NaN
3 77.0 2.0 NaN
4 88.0 1.0 NaN
15 NaN NaN 66.0
df = df1.join(df2, how="outer")
print (df)
user price
count sum seller_count
name date hour
A 9/17 1 33.0 34.0 100.0
2 66.0 55.0 NaN
3 77.0 2.0 NaN
4 88.0 1.0 NaN
15 NaN NaN 66.0
print (df.columns)
MultiIndex([( 'user', 'count'),
('price', 'sum'),
( '', 'seller_count')],
)
print (df.index)
MultiIndex([('A', '9/17', 1),
('A', '9/17', 2),
('A', '9/17', 3),
('A', '9/17', 4),
('A', '9/17', 15)],
names=['name', 'date', 'hour'])
Upvotes: 1
Reputation: 31011
I assume that column names in the index in df1 are of "single level". You can achieve it the following way:
The source file contains:
name,date,hour,user,price
, , ,count,sum
A,9/17,1,33,34
A,9/17,2,66,55
A,9/17,3,77,2
A,9/17,4,88,1
Note spaces as first 3 column names at the second level.
Read the file executing:
df1 = pd.read_csv('Input_1.csv', header=[0,1])
df1 = df1.set_index([('name', ' '), ('date', ' '), ('hour', ' ')])\
.rename_axis(index=['name', 'date', 'hour'])
This way "2-level" column names, after setting as the index, get single level names.
Another detail to note is that:
To perform the join, you have to start from adding a MultiIndex level to the column index in df2 (with a space as the top level):
df2.columns = pd.MultiIndex.from_product([[' '], df2.columns])
Then perform ordinary outer join:
result = df1.join(df2, how='outer')
The result is:
user price
count sum seller_count
name date hour
A 9/17 1 33.0 34.0 100.0
2 66.0 55.0 NaN
3 77.0 2.0 NaN
4 88.0 1.0 NaN
15 NaN NaN 66.0
Upvotes: 0