Reputation: 61084
I'd like to find a particular pattern in a pandas dataframe column, and return the corresponding index values in order to subset the dataframe.
Here's a sample dataframe with a possible pattern:
Snippet to produce dataframe:
import pandas as pd
import numpy as np
Observations = 10
Columns = 2
np.random.seed(123)
df = pd.DataFrame(np.random.randint(90,110,size=(Observations, Columns)),
columns = ['ColA','ColB'])
datelist = pd.date_range(pd.datetime(2017, 7, 7).strftime('%Y-%m-%d'),
periods=Observations).tolist()
df['Dates'] = datelist
df = df.set_index(['Dates'])
pattern = [100,90,105]
print(df)
Dataframe:
ColA ColB
Dates
2017-07-07 103 92
2017-07-08 92 96
2017-07-09 107 109
2017-07-10 100 91
2017-07-11 90 107
2017-07-12 105 99
2017-07-13 90 104
2017-07-14 90 105
2017-07-15 109 104
2017-07-16 94 90
Here, the pattern of interest occurs in Column A
on the dates 2017-07-10
to 2017-07-12
, and that's what I'd like to end up with:
Desired output:
2017-07-10 100 91
2017-07-11 90 107
2017-07-12 105 99
If the same pattern occurs several times, I would like to subset the dataframe the same way, and also count how many times the pattern occurs, but I hope that's more straight forward as long as I get the first step sorted out.
Thank you for any suggestions!
Upvotes: 5
Views: 19335
Reputation: 1212
While Deena's answer is elegant, it is horribly slow for longer sequences.
Flab's solution on the other hand is possibly faster, but it only works for numeric values.
Therefore, I want to propose a different solution, that is fast and works for all data-types. For the following I assume the data-type to be a string. If it is not in your case, you should convert it first.
rolling_cat = lambda s, n: pd.Series(zip(*[s.shift(-x) for x in range(n)])).str.join(",")
indices = df.loc[rolling_cat(df["ColA"], n=len(pattern) == (",".join(pattern))].index
Here, I define a function that concatenates every the next n entries to each entry in the column. This means, that each row now contains the potential pattern in the form of a string
Then I compare these patterns to a string representation of the pattern by concatenating the each entry of the pattern.
The reason why this is much faster than Deena's approach is, that string comparison is much faster than comparing each item in a list.
Upvotes: 1
Reputation: 6213
Using the magic of list comprehensions:
[df.index[i - len(pattern)] # Get the datetime index
for i in range(len(pattern), len(df)) # For each 3 consequent elements
if all(df['ColA'][i-len(pattern):i] == pattern)] # If the pattern matched
# [Timestamp('2017-07-10 00:00:00')]
Upvotes: 7
Reputation: 566
for col in df:
index = df[col][(df[col] == pattern[0]) & (df[col].shift(-1) == pattern[1]) & (df[col].shift(-2) == pattern[2])].index
if not index.empty: print(index)
Upvotes: 1
Reputation: 527
The shortest way is finding the index at which the pattern starts. Then you just need to select the three following rows.
In order to find these indexes, a one-liner is enough:
indexes=df[(df.ColA==pattern[0])&(df["ColA"].shift(-1)==pattern[1])&(df["ColA"].shift(-2)==pattern[2])].index
Then do as the other answer says to get the subsets that you want.
Upvotes: 1
Reputation: 7476
Here is a solution:
Check if the pattern was found in any of the columns using rolling. This will give you the last index of the group matching the pattern
matched = df.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern)))
matched = matched.sum(axis = 1).astype(bool) #Sum to perform boolean OR
matched
Out[129]:
Dates
2017-07-07 False
2017-07-08 False
2017-07-09 False
2017-07-10 False
2017-07-11 False
2017-07-12 True
2017-07-13 False
2017-07-14 False
2017-07-15 False
2017-07-16 False
dtype: bool
For each match, add the indexes of the complete pattern:
idx_matched = np.where(matched)[0]
subset = [range(match-len(pattern)+1, match+1) for match in idx_matched]
Get all the patterns:
result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0)
result
Out[128]:
ColA ColB
Dates
2017-07-10 100 91
2017-07-11 90 107
2017-07-12 105 99
Upvotes: 6