user3013706
user3013706

Reputation: 451

how to use groupby to avoid loop in python

There are several columns in the data, three are named "candidate_id", "enddate", "TitleLevel".

Within the same id, if the enddate is the same, I will delete the lower level record.

For example, given:

candidate_id   startdate     enddate   TitleLevel
    1          2012.1.1      2013.5.1     2
    1          2011.1.1      2013.5.1     4
    1          2008.12.1     2010.1.1     3
    2          2010.10.1     2012.12.1    2

What I want is:

candidate_id   startdate     enddate   TitleLevel
    1          2011.1.1      2013.5.1     4
    1          2008.12.1     2010.1.1     3
    2          2010.10.1     2012.12.1    2

I will delete candidate_id=1, enddate=2013.5.1, and titlelevel=2.

I have come up with a loop.

for i in range(nrow-2,-1, -1):
    if (JobData['enddate'][i] == JobData['enddate'][i+1] 
           and JobData['candidate_id'][i] == JobData['candidate_id'][i+1] 
           and pd.notnull(JobData['enddate'][i]):
        if JobData['TitleLevel'][i] > JobData['TitleLevel'][i+1]:
            JobData= JobData.drop(i+1)
        else:
            JobData= JobData.drop(i) 

The loop really takes some time to delete redundant rows. Is there a faster method?

Upvotes: 5

Views: 380

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375445

Assuming that data is sorted by startdate (at least within each group), you can use groupby last:

In [11]: df.groupby(['candidate_id', 'enddate'], as_index=False).last()
Out[11]: 
   candidate_id    enddate  startdate  TitleLevel
0             1   2010.1.1  2008.12.1           3
1             1   2013.5.1   2011.1.1           4
2             2  2012.12.1  2010.10.1           2

Upvotes: 1

alko
alko

Reputation: 48317

If you data structure is exactly as you describe, you can use groupby/max:

>>> df
   candidate_id    enddate  TitleLevel
0             1   2013.5.1           2
1             1   2013.5.1           4
2             1   2010.1.1           3
3             2  2012.12.1           2
>>> df.groupby(['candidate_id','enddate']).max().reset_index()
   candidate_id    enddate  TitleLevel
0             1   2010.1.1           3
1             1   2013.5.1           4
2             2  2012.12.1           2

Here groupby groups rows with equal candidate_id and enddate, and max() evaluates maximum TitleLevel within each group. Result is the same as if rows with all other values being dropped.

In case you have more columns,

>>> df
   candidate_id    enddate  TitleLevel other_column
0             1   2013.5.1           2          foo
1             1   2013.5.1           4          bar
2             1   2010.1.1           3       foobar
3             2  2012.12.1           2       barfoo

you can get idexes of rows with max values, without sorting if rows order has to be preserved:

>>> idx = df.groupby(['candidate_id','enddate'], sort=False)['TitleLevel'].agg(lambda x: x.idxmax())

and filter needed rows with ix:

>>> df.ix[idx]
   candidate_id    enddate  TitleLevel other_column
1             1   2013.5.1           4          bar
2             1   2010.1.1           3       foobar
3             2  2012.12.1           2       barfoo

Upvotes: 2

Related Questions