Count of successive similar strings

Question

I have a data-frame:

   Id  Seqno. Event
    1     2    A 
    1     3    B 
    1     5    A 
    1     6    A 
    1     7    A 
    1     8    B 
    1     9    C 
    1    10    D

I want to filter the data-frame according to the time "Event A happened successively". For example if i try Event A > 2 it should return all the Ids as

 Id Event count 
  1  A   3

So far I have tried

   df['new'] = df['Event'].shift()+ df['Event']

   a= df[df['new']=='AA']

   a[a['Id'].isin(a['Id'].value_counts()[a['Id'].value_counts()>2].index)]

But it doesn't seem to work.

user3483203 · Accepted Answer

This problem can be split into two parts. First you want to group by both Id, and consecutive elements in the Event series. This can be accomplished using shift + cumsum

m = df.Event.ne(df.Event.shift()).cumsum()
df['count'] = df.groupby(['Id', m])['Event'].transform('size')

print(df)

   Id  Seqno. Event  count
0   1       2     A      1
1   1       3     B      1
2   1       5     A      3
3   1       6     A      3
4   1       7     A      3
5   1       8     B      1
6   1       9     C      1
7   1      10     D      1

This gives us a series that identifies sequential runs in our Event column, but now we want to make lookup simple. We can drop_duplicates, so that each condition only returns a single run per Id/Event/count, and then use boolean indexing:

f = df[['Id', 'Event', 'count']].drop_duplicates()
f.loc[f.Event.eq('A') & f['count'].gt(2)]

   Id Event  count
2   1     A      3

Count of successive similar strings

Answers (2)

Related Questions