Pandas : Getting unique rows for a given column but conditional on some criteria of other columns

I'm using python 2.7. From a given data as follows:

data = pd.DataFrame({'id':['001','001','001','002','002','003','003','003','004','005'],
                   'status':['ground','unknown','air','ground','unknown','ground','unknown','unknown','unknown','ground'],
                   'value':[10,-5,12,20,-12,2,-4,-1,0,6]})

The data looks like this:

id     status     value
001    ground       10
001    unknown      -5
001    air          12
002    ground       20
002    unknown     -12
003    ground        2
003    unknown      -4
003    unknown      -1
004    unknown       0
005    ground        6

I would like to get the output in dataframe that has unique id conditional to the following criteria: For a given id

 'status': If 'air' does exist, pick 'air'.
           If 'air' does not exist, pick 'ground'.  
           If both 'air' and 'ground' do not exist, pick 'unknown'.

 'value': Sum of values for each id
 'count': Count the number of rows for each id

Therefore, the expected output is the following.

id     status     value     count
001     air        17        3
002    ground       8        2
003    ground      -3        3
004    unknown      0        1
005    ground       6        1

I can do looping for each unique id but it is not elegant enough and computation is also expensive, especially when data becomes large. May i know the better pythonic style and more efficient way to come up with this output? Thank you in advance.

Upvotes: 0

Views: 1570

Answers (2)

Little Bobby Tables
Little Bobby Tables

Reputation: 4744

You want to use groupby on id. This is easy for value and count but trickier for the status. We need to write our own function which takes a pandas Series and returns a single attribute.

def group_status(x):
    if (x=='air').any():
        y = 'air'
    elif (x=='ground').any():
        y = 'ground'
    else:
        y = 'unknown'
    return y

data = data.groupby(by='id').agg({'value': ['sum', 'count'], 'status': [group_status]})
data.columns = ['status', 'value', 'count']

print(data)

     status  value  count
id          
001  air      17     3
002  ground   8      2
003  ground   -3     3
004  unknown  0      1
005  ground   6      1

Here we have ensured that the air, ground, unknown order is preserved without the need to change the column type to categorical, as mentioned in ayhan's very elegant answer.

The group_status() function does lay the groundwork should you wish to incorporate more advanced groupby functionality.

Upvotes: 1

user2285236
user2285236

Reputation:

One option would be changing the type of status column to category and sorting based on that in groupby.agg:

df['status'] = df['status'].astype('category', categories=['air', 'ground', 'unknown'], ordered=True)

df.sort_values('status').groupby('id').agg({'status': 'first', 'value': ['sum', 'count']})
Out: 
      status value      
       first   sum count
id                      
001      air    17     3
002   ground     8     2
003   ground    -3     3
004  unknown     0     1
005   ground     6     1

Here, since the values are sorted in 'air', 'ground' and 'unknown' order, 'first' returns the correct value. If you don't want to change the type, you can define your own function that returns air/ground/unknown and instead of 'first' you can pass that function.

Upvotes: 2

Related Questions