Reputation: 341
Suppose I have a pandas dataframe like this:
first second third
1 2 2 1
2 2 1 0
3 3 4 5
4 4 6 3
5 5 4 3
6 8 8 4
7 3 4 2
8 5 6 6
and could be created with the code:
dataframe = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).
So, the output should be rows:
3, 7, 8
To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:
dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index
But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?
Upvotes: 2
Views: 502
Reputation: 93161
# First condition is easy
cond1 = df["second"] > df["first"]
# Since the second condition compare the second and third column, let's compute
# the result before hand for convenience
s = df["third"] < df["second"]
# Now we are gonna run a rolling window over `s`. What we want is that the
# previous `k` rows of `s` are all True.
# A rolling window always ends on the current row but you want it to end on the
# previous row. So we will increase the window size by 1 and exclude the last
# element from comparison.
all_true = lambda arr: arr[:-1].all()
cond2_with_k_equal_2 = s.rolling(3).apply(all_true, raw=True)
cond2_with_k_equal_3 = s.rolling(4).apply(all_true, raw=True)
cond2_with_k_equal_4 = s.rolling(5).apply(all_true, raw=True)
cond2 = cond2_with_k_equal_2 | cond2_with_k_equal_3 | cond2_with_k_equal_4
# Or you can consolidate them into a loop
cond2 = pd.Series(False, df.index)
for k in range(2,5):
cond2 |= s.rolling(k+1).apply(all_true, raw=True)
# Get the result
df[cond1 & cond2]
Upvotes: 1
Reputation: 1850
I will center my answer in the second part of your question. You need to use shift
function to compare. It allows you to shift rows.
Assuming your k
is fixed at 2, you should do something like this:
import pandas as pd
df = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]
What's going on?
Start comparing 'third'
with previous value of 'second'
by shifting one row, and then shift it two places in a second condition.
Note this only works for fixed values of k
. What if k
is variable?
In such case, you need to write your condition dynamically. The following code assumes that condition must be met for all values of n
in [1,k]
k = 2 # pick any k > 1
df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
first, we check using the shift
trick, which are the rows that meet your criteria for every value of n in [1, k]:
In [1]: [df['third'] < df['second'].shift(n) for n in range(1, k+1)]
out[1]:
[0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
dtype: bool,
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool]
then, we concatenate them to create a single dataframe, with a column for each of the k values.
In [2]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)])
Out[2]:
0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
finally, we pick to use as index all rows that meets the criteria for any column (i.e. value of n). So: if it is true for any n, we will return it:
In [3]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)
Out[3]:
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
Then, all you need to do is to project over your original dataframe and pick up the index.
In [3]: df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
Out[3]: Int64Index([0, 2], dtype='int64')
If the criteria must be met for all the values n in [1, k], then replace .any
with .all
.
Upvotes: 1
Reputation: 71689
The trick here is to calculate the rolling
sum
on a boolean mask to find out the number of values in k
previous rows where third
column is less than the second
column
k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)
print(df[m1 & m2])
first second third
3 3 4 5
7 3 4 2
8 5 6 6
Upvotes: 3