john doe
john doe

Reputation: 435

How do I groupBy, sort and put row data also in new columns in Pandas Python

I have a Dataframe like this:

id1   id2    c1   c2
1      a1     5     3
1      a1     6     2
1      a1     9     4
2      b1     5     2
2      b1     23    43
1      a1     34    32
3      3a     234   435

Now I want to add eight new columns like this:

c1_max   c2_max   c1_secondMax   c2_secondMax   c1_thirdMax c2_thirdMax   c1_min   c2_min

I want to groupBy id1 and id2

I already added the c1_max c2_max c1_min c2_min columns like this:

features = ["c1","c2"]
maxData = all_data.groupby(['id1','id2'])[features].agg('max')
all_data = pd.merge(all_data, maxData.reset_index(), suffixes=["", "_max"], how='left', on=['id1', 'id2'])

minData= all_data.groupby(['id1','id2'])[features].agg('min')
all_data = pd.merge(all_data, minData.reset_index(), suffixes=["", "_max"], how='left', on=['id1', 'id2'])

But, how do I add _secondMax and _thirdMax. Basically, they are the second max and third max of those rows.

Upvotes: 0

Views: 49

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

Let's try this:

smax = lambda x: x[x.rank(ascending = False) == 2]
smax.__name__ = 'secondMax'

tmax = lambda x: x[x.rank(ascending = False) == 3]
tmax.__name__ = 'thirdMax'

df_out = df.groupby(['id1','id2']).agg(['max','min',smax, tmax])
df_out.columns = [f'{i}_{j}' for i,j in df_out.columns]
df_out = df.merge(df_out, on=['id1','id2'])
df_out = df_out.where(df_out.astype(bool))
df_out

Output:

   id1 id2   c1   c2  c1_max  c1_min c1_secondMax c1_thirdMax  c2_max  c2_min c2_secondMax c2_thirdMax
0    1  a1    5    3      34       5            9           6      32       2            4           3
1    1  a1    6    2      34       5            9           6      32       2            4           3
2    1  a1    9    4      34       5            9           6      32       2            4           3
3    1  a1   34   32      34       5            9           6      32       2            4           3
4    2  b1    5    2      23       5            5         NaN      43       2            2         NaN
5    2  b1   23   43      23       5            5         NaN      43       2            2         NaN
6    3  3a  234  435     234     234          NaN         NaN     435     435          NaN         NaN

Upvotes: 1

yatu
yatu

Reputation: 88236

In the same way you have obtained maxdata and mindata, you can get the nth minimums and maximums doing:

df.sort_values(['id1','id2','c1']).groupby(['id1','id2']).c1.nth(1)

      id1  id2
1    a1      6
2    b1     23
Name: c1, dtype: int64

So in this case you would be obtaining the second minimum for c1

Where df.sort_values(['id1','id2','c1']) will sort the values according id1, id2 and c1 respectively, which is equivalent to grouping by id1 and id2 and sorting c1:

   id1 id2   c1   c2
0    1  a1    5    3
1    1  a1    6    2
2    1  a1    9    4
5    1  a1   34   32
3    2  b1    5    2
4    2  b1   23   43
6    3  3a  234  435

Upvotes: 1

Related Questions