Reputation: 2046
I'm trying to make a program that finds consecutive rows that meet some conditions. For example, if there's a dataframe that looks like this:
df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,14,15],
index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
columns=['value'])
>>> df
value
0 1
1 1
2 2
3 -13
4 -4
5 -5
6 6
7 17
8 8
9 9
10 -10
11 -11
12 -12
13 -13
14 -14
15 15
I want it to return a dataframe that shows rows that meet the conditions below:
1) the order has to be (positive rows)
and (negative rows)
, not the other way around.
2) each positive or negative group of rows has to have at least 3 rows
3) positive and negatives groups have to be adjacent to each other
posIdx, negIdx, posLength, negLength
0 2 3 3 3 # (1,1,2) (-13,-4,-5)
1 9 10 4 5 # (6,17,8,9) (-10,-11,-12,-13,-14)
Are there any simple ways to do this using python or pandas commands?
Upvotes: 4
Views: 7249
Reputation: 28644
This is just an alternative, and I did not benchmark this speed:
First, create a 'sign' column, indicating if a number is positive or negative.
Second, create a 'check' column as well, to indicate at what row, the change from positive to negative, or negative to positive occurred. If it is a -1, it implies a change from +ve to -ve; the reverse implies +1.
Next step, get the indices, where check is -1(neg_ids) and +1(pos_ids)
I use functions from more-itertools to intersperse the neg_ids and pos_ids. The aim is to get those chunks of rows that are altogether positive or negative.
Next phase is to run a for loop that uses the iloc function for each tuple created in the outcome variable, and find out if all the values in the 'value' column is positive or negative. Depending on the sign, we assign the results to keys in a 'K' dictionary. Note that posIdx will be the last row in that chunk (for wholly positive values), while for negIdx it will be the first row in the negative chunk. iloc does a start: end-1, so posIdx will be a end-1, while for negIdx, start does not need any addition or subtraction.
Last phase is to read the data into a dataframe
df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,-14,15],
index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
columns=['value'])
df['sign'] = np.where(df.value.lt(0),0,1)
df['check'] = df.sign.sub(df.sign.shift().fillna(0))
neg_ids = df.loc[df.check==-1].index.tolist()
pos_ids = df.loc[df.check==1].index.tolist()
from more_itertools import interleave_longest, windowed
outcome = list(interleave_longest(pos_ids,neg_ids))
outcome = list(windowed(outcome,2))
print(outcome)
[(0, 3), (3, 6), (6, 10), (10, 15)]
from collections import defaultdict
K = defaultdict(list)
for start, end in outcome:
checker = df.iloc[start:end,0]
if checker.ge(0).all() and checker.shape[0]>2:
K['posIdx'].append(end-1)
K['posLength'].append(checker.shape[0])
elif checker.lt(0).all() and checker.shape[0]>2:
K['negIdx'].append(start)
K['negLength'].append(checker.shape[0])
pd.DataFrame(K)
posIdx posLength negIdx negLength
0 2 3 3 3
1 9 4 10 5
Upvotes: 0
Reputation: 862641
I create helper columns for easy verify solution:
#column for negative and positive
df['sign'] = np.where(df['value'] < 0, 'neg','pos')
#consecutive groups
df['g'] = df['sign'].ne(df['sign'].shift()).cumsum()
#removed groups with length more like 2
df = df[df['g'].map(df['g'].value_counts()).gt(2)]
#tested if order `pos-neg` of groups, if not removed groups
m1 = df['sign'].eq('pos') & df['sign'].shift(-1).eq('neg')
m2 = df['sign'].eq('neg') & df['sign'].shift().eq('pos')
groups = df.loc[m1 | m2, 'g']
df = df[df['g'].isin(groups)].copy()
df['pairs'] = (df['sign'].ne(df['sign'].shift()) & df['sign'].eq('pos')).cumsum()
print (df)
value sign g pairs
0 1 pos 1 1
1 1 pos 1 1
2 2 pos 1 1
3 -13 neg 2 1
4 -4 neg 2 1
5 -5 neg 2 1
6 6 pos 3 2
7 17 pos 3 2
8 8 pos 3 2
9 9 pos 3 2
10 -10 neg 4 2
11 -11 neg 4 2
12 -12 neg 4 2
13 -13 neg 4 2
Last aggregate GroupBy.first
for all groups and counts by GroupBy.size
and named aggregation (pandas 0.25+), sorting columns and flatten MultiIndex, last correct Idx_pos
for subtract 1
:
df1 = (df.reset_index()
.groupby(['pairs','g', 'sign'])
.agg(Idx=('index','first'), Length=('sign','size'))
.reset_index(level=1, drop=True)
.unstack()
.sort_index(axis=1, level=[0,1], ascending=[True, False])
)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1['Idx_pos'] = df1['Idx_neg'] - 1
print (df1)
Idx_pos Idx_neg Length_pos Length_neg
pairs
1 2 3 3 3
2 9 10 4 4
Upvotes: 8