Pandas : Getting unique rows for a given column but conditional on some criteria of other columns

Question

I'm using python 2.7. From a given data as follows:

data = pd.DataFrame({'id':['001','001','001','002','002','003','003','003','004','005'],
                   'status':['ground','unknown','air','ground','unknown','ground','unknown','unknown','unknown','ground'],
                   'value':[10,-5,12,20,-12,2,-4,-1,0,6]})

The data looks like this:

id     status     value
001    ground       10
001    unknown      -5
001    air          12
002    ground       20
002    unknown     -12
003    ground        2
003    unknown      -4
003    unknown      -1
004    unknown       0
005    ground        6

I would like to get the output in dataframe that has unique id conditional to the following criteria: For a given id

 'status': If 'air' does exist, pick 'air'.
           If 'air' does not exist, pick 'ground'.  
           If both 'air' and 'ground' do not exist, pick 'unknown'.

 'value': Sum of values for each id
 'count': Count the number of rows for each id

Therefore, the expected output is the following.

id     status     value     count
001     air        17        3
002    ground       8        2
003    ground      -3        3
004    unknown      0        1
005    ground       6        1

I can do looping for each unique id but it is not elegant enough and computation is also expensive, especially when data becomes large. May i know the better pythonic style and more efficient way to come up with this output? Thank you in advance.

Little Bobby Tables · Accepted Answer

You want to use groupby on id. This is easy for value and count but trickier for the status. We need to write our own function which takes a pandas Series and returns a single attribute.

def group_status(x):
    if (x=='air').any():
        y = 'air'
    elif (x=='ground').any():
        y = 'ground'
    else:
        y = 'unknown'
    return y

data = data.groupby(by='id').agg({'value': ['sum', 'count'], 'status': [group_status]})
data.columns = ['status', 'value', 'count']

print(data)

     status  value  count
id          
001  air      17     3
002  ground   8      2
003  ground   -3     3
004  unknown  0      1
005  ground   6      1

Here we have ensured that the air, ground, unknown order is preserved without the need to change the column type to categorical, as mentioned in ayhan's very elegant answer.

The group_status() function does lay the groundwork should you wish to incorporate more advanced groupby functionality.

Pandas : Getting unique rows for a given column but conditional on some criteria of other columns

Answers (2)

Related Questions