user7379562
user7379562

Reputation: 349

How to group rows by frequency?

I want to know how many people putted GRADE 1, 2, 3, 4 and 5 among the groups of people who made the evaluations 1 time, between 2 and 3 times, and more than 3 times. For instance, the group of people making the evaluations 1 time contains persons with ID 2 and 4. In this group there are totally one evaluation 5, and one evaluation 1.

df =

ID_PERSON    EVALUATION_GRADE
1            2
1            2
1            3
1            5
2            5
3            2
3            5
3            1
4            1 
5            2
5            1
5            1

The result should be this one:

result =

FREQUENCY_GROUP   GRADE_1   GRADE_2   GRADE_3   GRADE_4   GRADE_5
"1 time"          1         0         0         0         1
"2-3 times"       3         2         0         0         1
"> 3 times"       0         2         1         0         1

If I do this df.groupby(['EVALUATION_GRADE']).agg({'ID_PERSON': 'count'}).reset_index(), then I get total number of persons that putted 1, 2, 3, 4 and 5. However how can I split them into frequency groups?

Upvotes: 4

Views: 90

Answers (3)

roman
roman

Reputation: 117345

Well, basic idea could be as follows - use GroupBy.transform to get frequency groups and then pandas.crosstab. to pivot data:

>>> def worker(x):
       if len(x) == 1:
           return "1 time" 
       elif len(x) <=3 :
           return "2-3 times"
       else:
           return "> 3 times"
>>> df['FREQUENCY_GROUP'] = df.groupby('ID_PERSON').transform(worker)
>>> df
    ID_PERSON  EVALUATION_GRADE FREQUENCY_GROUP
0           1                 2       > 3 times
1           1                 2       > 3 times
2           1                 3       > 3 times
3           1                 5       > 3 times
4           2                 5          1 time
5           3                 2       2-3 times
6           3                 5       2-3 times
7           3                 1       2-3 times
8           4                 1          1 time
9           5                 2       2-3 times
10          5                 1       2-3 times
11          5                 1       2-3 times
>>> pd.crosstab(df['FREQUENCY_GROUP'], 'GRADE ' + df['EVALUATION_GRADE'].astype('str'))
EVALUATION_GRADE  GRADE 1  GRADE 2  GRADE 3  GRADE 5
FREQUENCY_GROUP                                     
1 time                  1        0        0        1
2-3 times               3        2        0        1
> 3 times               0        2        1        1

Upvotes: 1

Ted Petrou
Ted Petrou

Reputation: 61947

Here is an answer that should generalize to any number of grades or ids

d = {1: '1 time', 2:'2-3 times', 3:'2-3 times', 4:'> 3 times'} 

df['FREQUENCY_GROUP'] = df.groupby('ID_PERSON')['ID_PERSON']\
                          .transform('size')\
                          .clip_upper(4)\
                          .map(d)

df1 = df.pivot_table(index='FREQUENCY_GROUP', 
                     columns='EVALUATION_GRADE', 
                     values='ID_PERSON', 
                     aggfunc='count', 
                     fill_value=0)\
       .reindex(columns=range(df.EVALUATION_GRADE.min(), 
                              df.EVALUATION_GRADE.max() + 1), 
                fill_value=0)

df1.columns = 'GRADE_'  + df1.columns.astype(str)

output

                 GRADE_1  GRADE_2  GRADE_3  GRADE_4  GRADE_5
FREQUENCY_GROUP                                             
1 time                 1        0        0        0        1
2-3 times              3        2        0        0        1
> 3 times              0        2        1        0        1

Upvotes: 1

jezrael
jezrael

Reputation: 862501

You can use first transform with size first for find frequency, then cut for create bins and last groupby with size, reshape by unstack and fill missing columns by reindex:

df['FREQ'] = df.groupby('ID_PERSON')['EVALUATION_GRADE'].transform('size')
bins = [-np.inf, 1, 3, np.inf]
labels=['1 time','2-3 times','> 3 times']
df.FREQ = pd.cut(df.FREQ, bins=bins, labels=labels)

df = df.groupby(['FREQ', 'EVALUATION_GRADE'])['EVALUATION_GRADE'] \
       .size() \
       .unstack(fill_value=0) \
       .reindex(columns=np.arange(1,6), fill_value=0) 
df.columns = 'GRADE '  + df.columns.astype(str)
print (df)
           GRADE 1  GRADE 2  GRADE 3  GRADE 4  GRADE 5
FREQ                                                  
1 time           1        0        0        0        1
2-3 times        3        2        0        0        1
> 3 times        0        2        1        0        1

Upvotes: 3

Related Questions