Reputation: 37
I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.
Upvotes: 2
Views: 74
Reputation: 3710
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 0
Reputation: 523
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
Upvotes: 1