vestland
vestland

Reputation: 61084

Pandas: How to find a particular pattern in a dataframe column?

I'd like to find a particular pattern in a pandas dataframe column, and return the corresponding index values in order to subset the dataframe.

Here's a sample dataframe with a possible pattern:

Snippet to produce dataframe:

import pandas as pd
import numpy as np

Observations = 10
Columns = 2
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
                  columns = ['ColA','ColB'])
datelist = pd.date_range(pd.datetime(2017, 7, 7).strftime('%Y-%m-%d'),
                         periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])

pattern = [100,90,105]
print(df)

Dataframe:

            ColA  ColB
Dates                 
2017-07-07   103    92
2017-07-08    92    96
2017-07-09   107   109
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99
2017-07-13    90   104
2017-07-14    90   105
2017-07-15   109   104
2017-07-16    94    90

Here, the pattern of interest occurs in Column A on the dates 2017-07-10 to 2017-07-12, and that's what I'd like to end up with:

Desired output:

2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

If the same pattern occurs several times, I would like to subset the dataframe the same way, and also count how many times the pattern occurs, but I hope that's more straight forward as long as I get the first step sorted out.

Thank you for any suggestions!

Upvotes: 5

Views: 19335

Answers (5)

chillichief
chillichief

Reputation: 1212

While Deena's answer is elegant, it is horribly slow for longer sequences.

Flab's solution on the other hand is possibly faster, but it only works for numeric values.

Therefore, I want to propose a different solution, that is fast and works for all data-types. For the following I assume the data-type to be a string. If it is not in your case, you should convert it first.

rolling_cat = lambda s, n: pd.Series(zip(*[s.shift(-x) for x in range(n)])).str.join(",")

indices = df.loc[rolling_cat(df["ColA"], n=len(pattern) == (",".join(pattern))].index

Here, I define a function that concatenates every the next n entries to each entry in the column. This means, that each row now contains the potential pattern in the form of a string

Then I compare these patterns to a string representation of the pattern by concatenating the each entry of the pattern.

The reason why this is much faster than Deena's approach is, that string comparison is much faster than comparing each item in a list.

Upvotes: 1

Deena
Deena

Reputation: 6213

Using the magic of list comprehensions:

[df.index[i - len(pattern)] # Get the datetime index 
 for i in range(len(pattern), len(df)) # For each 3 consequent elements 
 if all(df['ColA'][i-len(pattern):i] == pattern)] # If the pattern matched 

# [Timestamp('2017-07-10 00:00:00')]

Upvotes: 7

SayPy
SayPy

Reputation: 566

for col in df:
    index = df[col][(df[col] == pattern[0]) & (df[col].shift(-1) == pattern[1]) & (df[col].shift(-2) == pattern[2])].index
    if not index.empty: print(index)

Upvotes: 1

baloo
baloo

Reputation: 527

The shortest way is finding the index at which the pattern starts. Then you just need to select the three following rows.

In order to find these indexes, a one-liner is enough:

indexes=df[(df.ColA==pattern[0])&(df["ColA"].shift(-1)==pattern[1])&(df["ColA"].shift(-2)==pattern[2])].index

Then do as the other answer says to get the subsets that you want.

Upvotes: 1

FLab
FLab

Reputation: 7476

Here is a solution:

Check if the pattern was found in any of the columns using rolling. This will give you the last index of the group matching the pattern

matched = df.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern)))
matched = matched.sum(axis = 1).astype(bool)   #Sum to perform boolean OR

matched
Out[129]: 
Dates
2017-07-07    False
2017-07-08    False
2017-07-09    False
2017-07-10    False
2017-07-11    False
2017-07-12     True
2017-07-13    False
2017-07-14    False
2017-07-15    False
2017-07-16    False
dtype: bool

For each match, add the indexes of the complete pattern:

idx_matched = np.where(matched)[0]
subset = [range(match-len(pattern)+1, match+1) for match in idx_matched]

Get all the patterns:

result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0)

result
Out[128]: 
            ColA  ColB
Dates                 
2017-07-10   100    91
2017-07-11    90   107
2017-07-12   105    99

Upvotes: 6

Related Questions