how to use groupby to avoid loop in python

Question

There are several columns in the data, three are named "candidate_id", "enddate", "TitleLevel".

Within the same id, if the enddate is the same, I will delete the lower level record.

For example, given:

candidate_id   startdate     enddate   TitleLevel
    1          2012.1.1      2013.5.1     2
    1          2011.1.1      2013.5.1     4
    1          2008.12.1     2010.1.1     3
    2          2010.10.1     2012.12.1    2

What I want is:

candidate_id   startdate     enddate   TitleLevel
    1          2011.1.1      2013.5.1     4
    1          2008.12.1     2010.1.1     3
    2          2010.10.1     2012.12.1    2

I will delete candidate_id=1, enddate=2013.5.1, and titlelevel=2.

I have come up with a loop.

for i in range(nrow-2,-1, -1):
    if (JobData['enddate'][i] == JobData['enddate'][i+1] 
           and JobData['candidate_id'][i] == JobData['candidate_id'][i+1] 
           and pd.notnull(JobData['enddate'][i]):
        if JobData['TitleLevel'][i] > JobData['TitleLevel'][i+1]:
            JobData= JobData.drop(i+1)
        else:
            JobData= JobData.drop(i)

The loop really takes some time to delete redundant rows. Is there a faster method?

alko · Accepted Answer

If you data structure is exactly as you describe, you can use groupby/max:

>>> df
   candidate_id    enddate  TitleLevel
0             1   2013.5.1           2
1             1   2013.5.1           4
2             1   2010.1.1           3
3             2  2012.12.1           2
>>> df.groupby(['candidate_id','enddate']).max().reset_index()
   candidate_id    enddate  TitleLevel
0             1   2010.1.1           3
1             1   2013.5.1           4
2             2  2012.12.1           2

Here groupby groups rows with equal candidate_id and enddate, and max() evaluates maximum TitleLevel within each group. Result is the same as if rows with all other values being dropped.

In case you have more columns,

>>> df
   candidate_id    enddate  TitleLevel other_column
0             1   2013.5.1           2          foo
1             1   2013.5.1           4          bar
2             1   2010.1.1           3       foobar
3             2  2012.12.1           2       barfoo

you can get idexes of rows with max values, without sorting if rows order has to be preserved:

>>> idx = df.groupby(['candidate_id','enddate'], sort=False)['TitleLevel'].agg(lambda x: x.idxmax())

and filter needed rows with ix:

>>> df.ix[idx]
   candidate_id    enddate  TitleLevel other_column
1             1   2013.5.1           4          bar
2             1   2010.1.1           3       foobar
3             2  2012.12.1           2       barfoo

how to use groupby to avoid loop in python

Answers (2)

Related Questions