No_body
No_body

Reputation: 842

Count of successive similar strings

I have a data-frame:

   Id  Seqno. Event
    1     2    A 
    1     3    B 
    1     5    A 
    1     6    A 
    1     7    A 
    1     8    B 
    1     9    C 
    1    10    D 

I want to filter the data-frame according to the time "Event A happened successively". For example if i try Event A > 2 it should return all the Ids as

 Id Event count 
  1  A   3

So far I have tried

   df['new'] = df['Event'].shift()+ df['Event']

   a= df[df['new']=='AA']

   a[a['Id'].isin(a['Id'].value_counts()[a['Id'].value_counts()>2].index)]

But it doesn't seem to work.

Upvotes: 0

Views: 83

Answers (2)

Rocky Li
Rocky Li

Reputation: 5958

A function without using pandas internals, (which is arguably a better way to do this):

def eventmagic(event="A", num=2):
    subdf = df[(df["Event"] == event) & (df["Seqno."] > num)].sort_values(by="Seqno.")
    arr = subdf["Seqno."].values - np.arange(len(subdf)) # 5,6,7 to 5,5,5
    if len(arr) == 0: return 0
    i = 0
    while arr[i] == arr[0]:
        i += 1
        if i >= len(subdf):
            break
    return i

>>> eventmagic("B", 2)
1

>>> eventmagic("A", 1)
1

>>> eventmagic("A", 3)
3

>>> eventmagic("A", 10)
0

Upvotes: -1

user3483203
user3483203

Reputation: 51155

This problem can be split into two parts. First you want to group by both Id, and consecutive elements in the Event series. This can be accomplished using shift + cumsum

m = df.Event.ne(df.Event.shift()).cumsum()
df['count'] = df.groupby(['Id', m])['Event'].transform('size')

print(df)

   Id  Seqno. Event  count
0   1       2     A      1
1   1       3     B      1
2   1       5     A      3
3   1       6     A      3
4   1       7     A      3
5   1       8     B      1
6   1       9     C      1
7   1      10     D      1

This gives us a series that identifies sequential runs in our Event column, but now we want to make lookup simple. We can drop_duplicates, so that each condition only returns a single run per Id/Event/count, and then use boolean indexing:

f = df[['Id', 'Event', 'count']].drop_duplicates()
f.loc[f.Event.eq('A') & f['count'].gt(2)]

   Id Event  count
2   1     A      3

Upvotes: 3

Related Questions