Reputation: 147
I have the following Dataframe (1,2 millon rows):
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`
Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40
occur.
For the provided Dataframe that would mean:
The sould problem is that Your help is highly appreciated.
Upvotes: 1
Views: 2234
Reputation: 913
I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.
import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})
sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])
beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
state = row['A']
value = row['B']
if not beginn_flag and state == 'beginn':
beginn_flag = True
beginn_value = value
elif beginn_flag and state == 'end':
if value >= beginn_value + 40:
new_seq = Sequence(beginn_value, value)
sequence_list.append(new_seq)
beginn_flag = False
print(sequence_list)
This code outputs the following:
[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]
Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.
Upvotes: 2