Nagib
Nagib

Reputation: 55

Pandas: New column value based on the matching multi-level column's conditions

I have the following dataframe with multi-level columns

In [1]: data = {('A', '10'):[1,3,0,1],
                ('A', '20'):[3,2,0,0],
                ('A', '30'):[0,0,3,0],
                ('B', '10'):[3,0,0,0],
                ('B', '20'):[0,5,0,0],
                ('B', '30'):[0,0,1,0],
                ('C', '10'):[0,0,0,2],
                ('C', '20'):[1,0,0,0],
                ('C', '30'):[0,0,0,0]
                }
        df = pd.DataFrame(data)
        df
Out[1]:
   A        B        C
  10 20 30 10 20 30 10 20 30
0  1  3  0  3  0  0  0  1  0
1  3  2  0  0  5  0  0  0  0
2  0  0  3  0  0  1  0  0  0
3  1  0  0  0  0  0  2  0  0

In a new column results I want to return the combined column name containing the maximum value for each subset (i.e. second level column)

My desired output should look like the below

Out[2]:
   A        B        C
  10 20 30 10 20 30 10 20 30      results
0  1  3  0  3  0  0  0  1  0  A20&B10&C20
1  3  2  0  0  5  0  0  0  0      A10&B20
2  0  0  3  0  0  1  0  0  0      A30&B30
3  1  0  0  0  0  0  2  0  0      A10&C10

For example the first row:

For column 'A' the max value is under column '20' & for column 'B' there is only 1 value under '10' & for column 'C' also it is only one value under '20' & so the result would be A20&B10&C20

Edit: replacing "+" with "&" in the results column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator

Edit2: The solution provided by @A.B below didn't work for me for some reason. Although it is working on his side and for the sample data on google colab.

somehow the .idxmax(skipna = True) is causing a ValueError: No axis named 1 for object type Series

I found a workaround by transposing the data before this step, and then transposing it back after.

map_res = lambda x:  ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))

df['results'] = df.replace(0, np.nan)\
                  .T\  # Transpose here
                  .groupby(level=0)\  # Remove (axis=1) from here
                  .idxmax(skipna = True)\
                  .T\  # Transpose back here
                  .apply(map_res,axis=1)

I am still interested to know why it was is not working without the transpose though?

Upvotes: 3

Views: 350

Answers (3)

jezrael
jezrael

Reputation: 862501

Idea is replace 0 by NaN, so if use DataFrame.stack all rows with NaNs are removed. Then get indices by DataFrameGroupBy.idxmax, mapping second and third tuple values by map and aggregate join to new column per indices - first level:

df['results'] = (df.replace(0, np.nan)
                   .stack([0,1])
                   .groupby(level=[0,1])
                   .idxmax()
                   .map(lambda x: f'{x[1]}{x[2]}')
                   .groupby(level=0)
                   .agg('&'.join))
print (df)
   A        B        C            results
  10 20 30 10 20 30 10 20 30             
0  1  3  0  3  0  0  0  1  0  A20&B10&C20
1  3  2  0  0  5  0  0  0  0      A10&B20
2  0  0  3  0  0  1  0  0  0      A30&B30
3  1  0  0  0  0  0  2  0  0      A10&C10

Upvotes: 1

A.B
A.B

Reputation: 20445

  • Group by level 0 and axis=1

  • You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).

  • Apply function to rows (axix-1) to concat names

  • In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.

You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.

map_res = lambda x:  ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))

df['results'] = df.replace(0, np.nan)\
                  .groupby(level=0, axis=1)\
                  .idxmax(skipna = True)\
                  .apply(map_res,axis=1)

Here's output

    A       B           C               results
10  20  30  10  20  30  10  20  30  
0   1   3   0   3   0   0   0   1   0   A20,B10,C20
1   3   2   0   0   5   0   0   0   0   A10,B20
2   0   0   3   0   0   1   0   0   0   A30,B30
3   1   0   0   0   0   0   2   0   0   A10,C10

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

Try:

df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)

Prints:

   A        B        C       results
  10 20 30 10 20 30 10 20 30        
0  1  3  0  3  0  0  0  1  0       7
1  3  2  0  0  5  0  0  0  0       8
2  0  0  3  0  0  1  0  0  0       4
3  1  0  0  0  0  0  2  0  0       3

Upvotes: 1

Related Questions