Miszka_R
Miszka_R

Reputation: 149

Filtering top n values in pandas

I have the following dataset:

ID   Group Name   Information
1    A            'Info type1'
1    A            'Info type2' 
2    B            'Info type2' 
2    B            'Info type3' 
2    B            'Info type4'
3    A            'Info type2' 
3    A            'Info type5'
3    A            'Info type2' 

Ultimately, I want to count, how many items have been processed by a specific group and group them by a specific Info type.

In a first step I have defined a function to somehow filter the specific info type:

def checkrejcted(strval):
    if strval == 'Info type5':
        return 'Rejected'
    else:
        return 'Not rejected' 

In a next step, I have applied this function to the information column:

dataset['CheckRejected'] = dataset['Information'].apply(checkrejcted)

Lastly, I have dropped duplicates, after dropping the information column. So the dataset looks like:

ID   Group Name   CheckRejected
1    A            'Not rejected'
2    B            'Not rejected' 
3    A            'Not rejected'
3    A            'Rejected'

I am wondering, whether there is a smarter way to count how often a specific group name occurs and group it based on Not rejected, Rejected. It can happen, that specific items can have the information Rejected/Not rejected at the same time. This is fine, as I assume that within the countplot this item will be counted for both.

Upvotes: 1

Views: 314

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

You wrote that you want to count rows. So probably you need:

df.groupby(['Group Name', 'Information']).size()

For your sample data, the result is the following Series:

Group Name  Information
A           Info type1     1
            Info type2     3
            Info type5     1
B           Info type2     1
            Info type3     1
            Info type4     1
dtype: int64

Its MultiIndex contains the grouping key (both levels) and the value is just the number of occurrences.

Dropping duplicetes does not do the job, as you loose the information how many times particular combination occurred.

Or if you want to count only Rejected / Not Rejected cases then:

  • map Information column, using your function and creating a new column, say Status,
  • group by Group Name and Status.

The code to do it is:

df['Status'] = df.Information.apply(checkrejcted)
df.groupby(['Group Name', 'Status']).size()

Upvotes: 1

oppressionslayer
oppressionslayer

Reputation: 7222

You could use a map, and fillna with a default non matching action:

maps = { "'Info type5'": "'Rejected'" } 
or
maps = { "'Info type1'": "'Not Rejected'",   "'Info type2'": "'Not Rejected'" ,  "'Info type3'": "'Not Rejected'" ,  "'Info type4'": "'Not Rejected'", "'Info type5'": "'Rejected'"  } 

df['Information'].map(maps).fillna('Not Rejected')                                                                                                                                 

0    'Not Rejected'
1    'Not Rejected'
2    'Not Rejected'
3    'Not Rejected'
4    'Not Rejected'
5    'Not Rejected'
6        'Rejected'
7    'Not Rejected'

df['CheckRejected'] = df['Information'].map(maps).fillna("'Not Rejected'")

   ID Group Name   Information   CheckRejected
0   1          A  'Info type1'  'Not Rejected'
1   1          A  'Info type2'  'Not Rejected'
2   2          B  'Info type2'  'Not Rejected'
3   2          B  'Info type3'  'Not Rejected'
4   2          B  'Info type4'  'Not Rejected'
5   3          A  'Info type2'  'Not Rejected'
6   3          A  'Info type5'      'Rejected'
7   3          A  'Info type2'  'Not Rejected'

df.drop(columns='Information').drop_duplicates()

   ID Group Name   CheckRejected
0   1          A  'Not Rejected'
2   2          B  'Not Rejected'
5   3          A  'Not Rejected'
6   3          A      'Rejected'

Upvotes: 1

Related Questions