Reputation: 147
I have the following data:
df1 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Play', 1: [1, 2, 3, 4, 5]}).set_index(['Room', 'User', 'Task'])
df2 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Play', 2: [1, 2, 3, 4, 5]}).set_index(['Room', 'User', 'Task'])
df3 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Clean', 1: [6, 7, 8, 9, 10]}).set_index(['Room', 'User', 'Task'])
df4 = pd.DataFrame({'Room': [1, 2, 3, 5, 8], 'User': 'Martin', 'Task': 'Clean', 2: [6, 7, 8, 9, 10]}).set_index(['Room', 'User', 'Task'])
df = pd.concat([df1, df2, df3, df4]).sort_index()
And the output result looks like:
I wonder why the multi-index has a duplicate entry for each column there is.
I expected and want the output format to be like this, where all the multi-index keys only occur once and all NaN
values are gone:
This would significantly reduce the size of my dataframe and also later on the storage size on the phy. drive.
Upvotes: 0
Views: 47
Reputation: 862521
If is possible sum values:
df = df.sum(level=[0,1,2])
#alternative
#df = df.groupby(level=[0,1,2]).sum()
print (df)
1 2
Room User Task
1 Martin Clean 6.0 6.0
Play 1.0 1.0
2 Martin Clean 7.0 7.0
Play 2.0 2.0
3 Martin Clean 8.0 8.0
Play 3.0 3.0
5 Martin Clean 9.0 9.0
Play 4.0 4.0
8 Martin Clean 10.0 10.0
Play 5.0 5.0
If possible get only first non missing value:
df = df.groupby(level=[0,1,2], sort=False).first()
print (df)
1 2
Room User Task
1 Martin Clean 6.0 6.0
Play 1.0 1.0
2 Martin Clean 7.0 7.0
Play 2.0 2.0
3 Martin Clean 8.0 8.0
Play 3.0 3.0
5 Martin Clean 9.0 9.0
Play 4.0 4.0
8 Martin Clean 10.0 10.0
Play 5.0 5.0
Upvotes: 1