Reputation: 3177
I'm using python 2.7. From a given data as follows:
data = pd.DataFrame({'id':['001','001','001','002','002','003','003','003','004','005'],
'status':['ground','unknown','air','ground','unknown','ground','unknown','unknown','unknown','ground'],
'value':[10,-5,12,20,-12,2,-4,-1,0,6]})
The data looks like this:
id status value
001 ground 10
001 unknown -5
001 air 12
002 ground 20
002 unknown -12
003 ground 2
003 unknown -4
003 unknown -1
004 unknown 0
005 ground 6
I would like to get the output in dataframe that has unique id conditional to the following criteria: For a given id
'status': If 'air' does exist, pick 'air'.
If 'air' does not exist, pick 'ground'.
If both 'air' and 'ground' do not exist, pick 'unknown'.
'value': Sum of values for each id
'count': Count the number of rows for each id
Therefore, the expected output is the following.
id status value count
001 air 17 3
002 ground 8 2
003 ground -3 3
004 unknown 0 1
005 ground 6 1
I can do looping for each unique id but it is not elegant enough and computation is also expensive, especially when data becomes large. May i know the better pythonic style and more efficient way to come up with this output? Thank you in advance.
Upvotes: 0
Views: 1570
Reputation: 4744
You want to use groupby
on id. This is easy for value and count but trickier for the status. We need to write our own function which takes a pandas Series and returns a single attribute.
def group_status(x):
if (x=='air').any():
y = 'air'
elif (x=='ground').any():
y = 'ground'
else:
y = 'unknown'
return y
data = data.groupby(by='id').agg({'value': ['sum', 'count'], 'status': [group_status]})
data.columns = ['status', 'value', 'count']
print(data)
status value count
id
001 air 17 3
002 ground 8 2
003 ground -3 3
004 unknown 0 1
005 ground 6 1
Here we have ensured that the air, ground, unknown order is preserved without the need to change the column type to categorical, as mentioned in ayhan's very elegant answer.
The group_status()
function does lay the groundwork should you wish to incorporate more advanced groupby functionality.
Upvotes: 1
Reputation:
One option would be changing the type of status column to category and sorting based on that in groupby.agg:
df['status'] = df['status'].astype('category', categories=['air', 'ground', 'unknown'], ordered=True)
df.sort_values('status').groupby('id').agg({'status': 'first', 'value': ['sum', 'count']})
Out:
status value
first sum count
id
001 air 17 3
002 ground 8 2
003 ground -3 3
004 unknown 0 1
005 ground 6 1
Here, since the values are sorted in 'air'
, 'ground'
and 'unknown'
order, 'first'
returns the correct value. If you don't want to change the type, you can define your own function that returns air
/ground
/unknown
and instead of 'first'
you can pass that function.
Upvotes: 2