iskandarblue
iskandarblue

Reputation: 7526

Counting recurring sequences in pandas

I am new to identifying patterns with Python and could use some direction. I have a large data set whose sample I've pasted below:

My objective is to find any sequential pattern of 'foo' 'bar' 'baz', and to count recurring patterns of 'foo' 'bar' 'baz' if the pattern repeats itself multiple times, while grouping by id.

id  class_name  created_at
0   1   foo 2019-02-08 19:11:04
1   1   bar 2019-02-08 19:11:34
2   1   foo 2019-02-08 19:12:04
3   1   baz 2019-02-08 19:12:35
4   1   bar 2019-02-08 19:13:05
5   1   foo 2019-02-08 19:13:35
6   1   bar 2019-02-08 19:14:04
7   1   baz 2019-02-08 19:14:35
8   1   foo 2019-02-08 19:15:05
9   1   bar 2019-02-08 19:15:35
10  1   baz 2019-02-08 19:16:03
11  2   foo 2019-02-08 19:16:34
12  2   bar 2019-02-08 19:17:07
13  2   foo 2019-02-08 19:17:42
14  2   bar 2019-02-08 19:18:04
15  2   baz 2019-02-08 19:18:34
16  2   baz 2019-02-08 19:19:04
17  2   bar 2019-02-08 19:19:34
18  2   bar 2019-02-08 19:20:04
19  2   foo 2019-02-08 19:20:34

For example, the output from the above dataset would look something like:

id  count           start_time             end_time
 1      2  2019-02-08 19:13:35  2019-02-08 19:16:03
 2      1  2019-02-08 19:17:42  2019-02-08 19:18:34

the column types are as follows:

id                     int64
class_name            object
created_at    datetime64[ns]
dtype: object

what modules would be best suited for this task?

here is the data:

{'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'class_name': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'baz', 4: 'bar', 5: 'foo', 6: 'bar', 7: 'baz', 8: 'foo', 9: 'bar', 10: 'baz', 11: 'foo', 12: 'bar', 13: 'foo', 14: 'bar', 15: 'baz', 16: 'baz', 17: 'bar', 18: 'bar', 19: 'foo'}, 'created_at': {0: Timestamp('2019-02-08 19:11:04'), 1: Timestamp('2019-02-08 19:11:34'), 2: Timestamp('2019-02-08 19:12:04'), 3: Timestamp('2019-02-08 19:12:35'), 4: Timestamp('2019-02-08 19:13:05'), 5: Timestamp('2019-02-08 19:13:35'), 6: Timestamp('2019-02-08 19:14:04'), 7: Timestamp('2019-02-08 19:14:35'), 8: Timestamp('2019-02-08 19:15:05'), 9: Timestamp('2019-02-08 19:15:35'), 10: Timestamp('2019-02-08 19:16:03'), 11: Timestamp('2019-02-08 19:16:34'), 12: Timestamp('2019-02-08 19:17:07'), 13: Timestamp('2019-02-08 19:17:42'), 14: Timestamp('2019-02-08 19:18:04'), 15: Timestamp('2019-02-08 19:18:34'), 16: Timestamp('2019-02-08 19:19:04'), 17: Timestamp('2019-02-08 19:19:34'), 18: Timestamp('2019-02-08 19:20:04'), 19: Timestamp('2019-02-08 19:20:34')}}

Upvotes: 1

Views: 113

Answers (3)

giser_yugang
giser_yugang

Reputation: 6166

Adding a sequence comparison method, you can use rolling().

df['class_name'] = pd.factorize(df['class_name'])[0]

def custom_func(frame):
    frame['match']=frame['class_name'].rolling(3).apply(lambda x: np.array_equal(x, [0, 1, 2]), raw=True)
    frame['start_time'] = frame['created_at'].shift(2)
    frame = frame[frame['match']==1].agg({'match':'count','start_time':'min','created_at':'max'})
    return frame

df = df.groupby('id').apply(lambda frame:custom_func(frame)).rename(columns={'match':'count','created_at':'end_time'})
print(df)

    count          start_time            end_time
id                                               
1       2 2019-02-08 19:13:35 2019-02-08 19:16:03
2       1 2019-02-08 19:17:42 2019-02-08 19:18:34

Upvotes: 1

Rajat Jain
Rajat Jain

Reputation: 2032

A simple solution can be:

df.groupby("id").apply(lambda x : len(re.findall("foo bar baz", ' '.join(x['class_name']))))

Upvotes: 0

Matt W.
Matt W.

Reputation: 3722

Takes a few steps but ultimately gets there..

Initialize Data:

import pandas as pd
from pandas import Timestamp
import numpy as np


dict_ ={'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'class_name': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'baz', 4: 'bar', 5: 'foo', 6: 'bar', 7: 'baz', 8: 'foo', 9: 'bar', 10: 'baz', 11: 'foo', 12: 'bar', 13: 'foo', 14: 'bar', 15: 'baz', 16: 'baz', 17: 'bar', 18: 'bar', 19: 'foo'}, 'created_at': {0: Timestamp('2019-02-08 19:11:04'), 1: Timestamp('2019-02-08 19:11:34'), 2: Timestamp('2019-02-08 19:12:04'), 3: Timestamp('2019-02-08 19:12:35'), 4: Timestamp('2019-02-08 19:13:05'), 5: Timestamp('2019-02-08 19:13:35'), 6: Timestamp('2019-02-08 19:14:04'), 7: Timestamp('2019-02-08 19:14:35'), 8: Timestamp('2019-02-08 19:15:05'), 9: Timestamp('2019-02-08 19:15:35'), 10: Timestamp('2019-02-08 19:16:03'), 11: Timestamp('2019-02-08 19:16:34'), 12: Timestamp('2019-02-08 19:17:07'), 13: Timestamp('2019-02-08 19:17:42'), 14: Timestamp('2019-02-08 19:18:04'), 15: Timestamp('2019-02-08 19:18:34'), 16: Timestamp('2019-02-08 19:19:04'), 17: Timestamp('2019-02-08 19:19:34'), 18: Timestamp('2019-02-08 19:20:04'), 19: Timestamp('2019-02-08 19:20:34')}}
df=pd.DataFrame(dict_)

I shift the end date back two spots so that we have a beginning and end for each 3 steps. I do this within the groups to maintain continuity:

df['end_time'] = df.groupby('id')['created_at'].shift(-2)

To find spots where we have sequential ['foo', 'bar', 'baz'], I zip together df['class_name'], along with shift(-1) and shift(-2) of class_name

[[x,y,z] for x,y,z in zip(df['class_name'], df['class_name'].shift(-1), df['class_name'].shift(-2))]
[['foo', 'bar', 'foo'],
 ['bar', 'foo', 'baz'],
 ['foo', 'baz', 'bar'],
 ['baz', 'bar', 'foo'],
 ['bar', 'foo', 'bar'],
 ['foo', 'bar', 'baz'],
 ['bar', 'baz', 'foo'],
 ['baz', 'foo', 'bar'],
 ['foo', 'bar', 'baz'],
 ['bar', 'baz', 'foo'],
 ['baz', 'foo', 'bar'],
 ['foo', 'bar', 'foo'],
 ['bar', 'foo', 'bar'],
 ['foo', 'bar', 'baz'],
 ['bar', 'baz', 'baz'],
 ['baz', 'baz', 'bar'],
 ['baz', 'bar', 'bar'],
 ['bar', 'bar', 'foo'],
 ['bar', 'foo', nan],
 ['foo', nan, nan]]

I then convert that to a numpy array and compare that with what we're looking for.

matches = np.array([[x,y,z] for x,y,z in zip(df['class_name'], df['class_name'].shift(-1), df['class_name'].shift(-2))]) == ['foo', 'bar', 'baz']
array([[ True,  True, False],
       [False, False,  True],
       [ True, False, False],
       [False,  True, False],
       [False, False, False],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [ True,  True, False],
       [False, False, False],
       [ True,  True,  True],
       [False, False,  True],
       [False, False, False],
       [False,  True, False],
       [False,  True, False],
       [False, False, False],
       [ True, False, False]])

Then to get the subsetting vector I just .all() compare the array. This will give us the starting points

vec = [x.all() == True for x in x]
[False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False]

Now we subset and inspect

subset = df.loc[vec]
id class_name          created_at            end_time
5    1        foo 2019-02-08 19:13:35 2019-02-08 19:14:35
8    1        foo 2019-02-08 19:15:05 2019-02-08 19:16:03
13   2        foo 2019-02-08 19:17:42 2019-02-08 19:18:34

Since we want grouped versions, we can just groupby and agg to get the final result.

subset.groupby('id').agg({'class_name':'count', 'created_at':'min', 'end_time':'max'})
    class_name          created_at            end_time
id                                                    
1            2 2019-02-08 19:13:35 2019-02-08 19:16:03
2            1 2019-02-08 19:17:42 2019-02-08 19:18:34

Upvotes: 2

Related Questions