Reputation: 55
I have the following dataframe with multi-level columns
In [1]: data = {('A', '10'):[1,3,0,1],
('A', '20'):[3,2,0,0],
('A', '30'):[0,0,3,0],
('B', '10'):[3,0,0,0],
('B', '20'):[0,5,0,0],
('B', '30'):[0,0,1,0],
('C', '10'):[0,0,0,2],
('C', '20'):[1,0,0,0],
('C', '30'):[0,0,0,0]
}
df = pd.DataFrame(data)
df
Out[1]:
A B C
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0
1 3 2 0 0 5 0 0 0 0
2 0 0 3 0 0 1 0 0 0
3 1 0 0 0 0 0 2 0 0
In a new column results
I want to return the combined column name containing the maximum value for each subset (i.e. second level column)
My desired output should look like the below
Out[2]:
A B C
10 20 30 10 20 30 10 20 30 results
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
For example the first row:
For column 'A' the max value is under column '20' &
for column 'B' there is only 1 value under '10' &
for column 'C' also it is only one value under '20' &
so the result would be A20&B10&C20
Edit: replacing "+" with "&" in the results
column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator
Edit2: The solution provided by @A.B below didn't work for me for some reason. Although it is working on his side and for the sample data on google colab.
somehow the .idxmax(skipna = True)
is causing a ValueError: No axis named 1 for object type Series
I found a workaround by transposing the data before this step, and then transposing it back after.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.T\ # Transpose here
.groupby(level=0)\ # Remove (axis=1) from here
.idxmax(skipna = True)\
.T\ # Transpose back here
.apply(map_res,axis=1)
I am still interested to know why it was is not working without the transpose though?
Upvotes: 3
Views: 350
Reputation: 862501
Idea is replace 0
by NaN
, so if use DataFrame.stack
all rows with NaN
s are removed. Then get indices by DataFrameGroupBy.idxmax
, mapping second and third tuple values by map
and aggregate join
to new column per indices - first level:
df['results'] = (df.replace(0, np.nan)
.stack([0,1])
.groupby(level=[0,1])
.idxmax()
.map(lambda x: f'{x[1]}{x[2]}')
.groupby(level=0)
.agg('&'.join))
print (df)
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
Upvotes: 1
Reputation: 20445
Group by level 0 and axis=1
You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).
Apply function to rows (axix-1) to concat names
In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.
You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.groupby(level=0, axis=1)\
.idxmax(skipna = True)\
.apply(map_res,axis=1)
Here's output
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20,B10,C20
1 3 2 0 0 5 0 0 0 0 A10,B20
2 0 0 3 0 0 1 0 0 0 A30,B30
3 1 0 0 0 0 0 2 0 0 A10,C10
Upvotes: 1
Reputation: 195408
Try:
df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)
Prints:
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 7
1 3 2 0 0 5 0 0 0 0 8
2 0 0 3 0 0 1 0 0 0 4
3 1 0 0 0 0 0 2 0 0 3
Upvotes: 1