user8495738
user8495738

Reputation: 147

Python Pandas: Find a pattern in a DataFrame

I have the following Dataframe (1,2 millon rows):

df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`

Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40 occur. For the provided Dataframe that would mean: enter image description here

The sould problem is that Your help is highly appreciated.

Upvotes: 1

Views: 2234

Answers (1)

I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.

import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})

sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])

beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
    state = row['A']
    value = row['B']

    if not beginn_flag and state == 'beginn':
        beginn_flag = True
        beginn_value = value 
    elif beginn_flag and state == 'end':
        if value >= beginn_value + 40:
            new_seq = Sequence(beginn_value, value)
            sequence_list.append(new_seq)
            beginn_flag = False

 print(sequence_list)

This code outputs the following:

[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]

Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.

Upvotes: 2

Related Questions