Lisa
Lisa

Reputation: 4426

Groupby DataFrame by its rank/percentile

Would you help me to come up a better solution for the problem as follows: For each date(in columns), I have values. I rank them and assign into three groups. My goal is to group the value by Low, Mid, Top group, and compute the group mean. I put the DataFrame and my own solution as follows. Can anyone suggest a better solution? My real data has 10000 rows and 300 columns. Any apply or lambda way to do it more directly? Thank you.

value=DataFrame({'1/1/2000':[1,4,5,6,5],
             '1/1/2001':[3,7,8,9,4],
             '1/1/2002':[7,8,9,4,5]}, index=list('ABCDE'))
rank=value.apply(lambda x: pd.qcut(x, 3, labels=['low','mid','top']))
df=pd.DataFrame()
df['Value']=value.stack()
df['Rank']=rank.stack()
df1=df.reset_index()
df1.groupby(['Date','Rank']).mean()

enter image description here

enter image description here enter image description here

Upvotes: 2

Views: 824

Answers (1)

jezrael
jezrael

Reputation: 862671

You can use only one stack and then pd.qcut only for one column Value instead all DataFrame:

df = value.stack()
          .reset_index(name='Value')
          .rename(columns={'level_0':'Type','level_1':'Date'})

df['Rank'] = pd.qcut(df.Value, 3, labels=['low','mid','top'])
print (df)

   Type      Date  Value Rank
0     A  1/1/2000      1  low
1     A  1/1/2001      3  low
2     A  1/1/2002      7  mid
3     B  1/1/2000      4  low
4     B  1/1/2001      7  mid
5     B  1/1/2002      8  top
6     C  1/1/2000      5  mid
7     C  1/1/2001      8  top
8     C  1/1/2002      9  top
9     D  1/1/2000      6  mid
10    D  1/1/2001      9  top
11    D  1/1/2002      4  low
12    E  1/1/2000      5  mid
13    E  1/1/2001      4  low
14    E  1/1/2002      5  mid
print (df.groupby(['Date','Rank'])['Value'].mean())
Date      Rank
1/1/2000  low     2.500000
          mid     5.333333
1/1/2001  low     3.500000
          mid     7.000000
          top     8.500000
1/1/2002  low     4.000000
          mid     6.000000
          top     8.500000
Name: Value, dtype: float64

print (df.groupby(['Date','Rank'])['Value'].mean().reset_index(name='Value'))
       Date Rank     Value
0  1/1/2000  low  2.500000
1  1/1/2000  mid  5.333333
2  1/1/2001  low  3.500000
3  1/1/2001  mid  7.000000
4  1/1/2001  top  8.500000
5  1/1/2002  low  4.000000
6  1/1/2002  mid  6.000000
7  1/1/2002  top  8.500000

EDIT:

You can also omit creating new column Rank and pass it to groupby:

df = value.stack()
          .reset_index(name='Rank')
          .rename(columns={'level_0':'Type','level_1':'Date'})

print (df)
   Type      Date  Rank
0     A  1/1/2000     1
1     A  1/1/2001     3
2     A  1/1/2002     7
3     B  1/1/2000     4
4     B  1/1/2001     7
5     B  1/1/2002     8
6     C  1/1/2000     5
7     C  1/1/2001     8
8     C  1/1/2002     9
9     D  1/1/2000     6
10    D  1/1/2001     9
11    D  1/1/2002     4
12    E  1/1/2000     5
13    E  1/1/2001     4
14    E  1/1/2002     5

print (df['Rank'].groupby([df.Date, 
                           pd.qcut(df.Rank, 3, labels=['low','mid','top'])])
                 .mean().reset_index(name='Value'))

       Date Rank     Value
0  1/1/2000  low  2.500000
1  1/1/2000  mid  5.333333
2  1/1/2001  low  3.500000
3  1/1/2001  mid  7.000000
4  1/1/2001  top  8.500000
5  1/1/2002  low  4.000000
6  1/1/2002  mid  6.000000
7  1/1/2002  top  8.500000

Upvotes: 1

Related Questions