mysterious_guy
mysterious_guy

Reputation: 435

counting the first occurence of duplicates in a dataframe

    Agent Amount repeat_count
    A20   50 USD
    A30   70 USD
    A60   80 USD
    A30   70 USD   1
    A20   57 USD
    A20   50 USD

SO, above is a small sample of my dataframe. I need to count sequential duplicates i.e. those agents which are transferring the same amount as they did in the previous transaction. For instance , A30 is transferring amount 70 USD twice in a row and hence I need to store the count. Even A20 is sending amount 50 USD twice , but between that it is also sending amount 57 USD . So , I do not want to store its count. Thanks in advance.

Upvotes: 0

Views: 488

Answers (1)

IanS
IanS

Reputation: 16251

Let's use a larger example:

Agent Amount
A20   50
A30   70
A60   80
A30   70
A20   57
A20   50
A30   70
A30   80
A30   70

As usual with such a problem, we will use groupby to work on each agent separately, so we can solve the problem for a single agent first. Let:

df1 = df[df.Agent == 'A30']

The following will find when the same amount is repeated:

df1.Amount.shift() == df1.Amount

So you can count the number of occurrences with a cumsum:

In [11]: (df1.Amount.shift() == df1.Amount).cumsum()
Out[11]: 
1    0
3    1
6    2
7    2
8    2

Let's apply the above solution to the original dataframe:

In [12]: df.groupby('Agent').apply(
             lambda df1: (df1.Amount.shift() == df1.Amount).cumsum()
         )
Out[12]: 
Agent   
A20    0    0
       4    0
       5    0
A30    1    0
       3    1
       6    2
       7    2
       8    2
A60    2    0

In order to merge the result with the original dataframe, we need to drop the first level of the index (the agents):

repeat_count = df.groupby('Agent').apply(
                   lambda df1: (df1.Amount.shift() == df1.Amount).cumsum()
               )
pd.concat([df, repeat_count.reset_index(level=0, drop=True)], axis=1)

The function concat will merge based on index values, so the results in repeat_count are aligned with the original dataframe.

Upvotes: 1

Related Questions