Anyone
Anyone

Reputation: 37

Pandas - How to groupby and remove specifc rows

I have a DF like this:

id     company     duration
0    Other Company    5
0    Other Company    19
0    X Company        7
1    Other Company    24
1    Other Company    6
1    X Company        12
2    X Company        9
3    Other Company    30
3    X Company        16

I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:

import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)

And got this:

id     company     duration
0    Other Company    24
0    X Company        7
1    Other Company    30
1    X Company        12
2    X Company        9
3    Other Company    30
3    X Company        16

Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'

Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'

Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..

Can someone shed some light in the issue for me? Thanks in advance.

Upvotes: 2

Views: 74

Answers (2)

2Obe
2Obe

Reputation: 3710

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:

import numpy as np
import pandas as pd


ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]

df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

You get:

ids  company      
0    Other Company    24
1    Other Company    30
3    Other Company    30
Name: duration, dtype: int64

EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

You get:

duration
ids company                
0.0 Other Company      24.0
1.0 Other Company      30.0
3.0 Other Company      30.0

Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 0

pizza lover
pizza lover

Reputation: 523

Does this help?

time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"] 

Upvotes: 1

Related Questions