Mahdi
Mahdi

Reputation: 341

Checking the length of a part of a dataframe in conditional row selection in pandas

Suppose I have a pandas dataframe like this:

    first   second  third
1     2       2       1
2     2       1       0
3     3       4       5
4     4       6       3
5     5       4       3
6     8       8       4
7     3       4       2
8     5       6       6

and could be created with the code:

dataframe = pd.DataFrame(
    {
        'first': [2, 2, 3, 4, 5, 8, 3, 5], 
        'second': [2, 1, 4, 6, 4, 8, 4, 6], 
        'third': [1, 0, 5, 3, 3, 4, 2, 6]
    }
)

I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).

So, the output should be rows: 3, 7, 8

To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:

dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index

But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?

Upvotes: 2

Views: 502

Answers (3)

Code Different
Code Different

Reputation: 93161

# First condition is easy
cond1 = df["second"] > df["first"]

# Since the second condition compare the second and third column, let's compute
# the result before hand for convenience
s = df["third"] < df["second"]

# Now we are gonna run a rolling window over `s`. What we want is that the
# previous `k` rows of `s` are all True.
# A rolling window always ends on the current row but you want it to end on the
# previous row. So we will increase the window size by 1 and exclude the last
# element from comparison.
all_true = lambda arr: arr[:-1].all()

cond2_with_k_equal_2 = s.rolling(3).apply(all_true, raw=True)
cond2_with_k_equal_3 = s.rolling(4).apply(all_true, raw=True)
cond2_with_k_equal_4 = s.rolling(5).apply(all_true, raw=True)
cond2 = cond2_with_k_equal_2 | cond2_with_k_equal_3 | cond2_with_k_equal_4

# Or you can consolidate them into a loop
cond2 = pd.Series(False, df.index)
for k in range(2,5):
    cond2 |= s.rolling(k+1).apply(all_true, raw=True)

# Get the result
df[cond1 & cond2]

Upvotes: 1

Rodrigo Laguna
Rodrigo Laguna

Reputation: 1850

I will center my answer in the second part of your question. You need to use shift function to compare. It allows you to shift rows.

Assuming your k is fixed at 2, you should do something like this:

import pandas as pd

df = pd.DataFrame(
    {
        'first': [2, 2, 3, 4, 5, 8, 3, 5], 
        'second': [2, 1, 4, 6, 4, 8, 4, 6], 
        'third': [1, 0, 5, 3, 3, 4, 2, 6]
    }
)

# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]

What's going on?

Start comparing 'third' with previous value of 'second' by shifting one row, and then shift it two places in a second condition.

Note this only works for fixed values of k. What if k is variable?

In such case, you need to write your condition dynamically. The following code assumes that condition must be met for all values of n in [1,k]

k = 2  # pick any k > 1

df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index

What's going on here?: Long answer

first, we check using the shift trick, which are the rows that meet your criteria for every value of n in [1, k]:

In [1]: [df['third'] < df['second'].shift(n) for n in range(1, k+1)]
out[1]: 
[0    False
 1     True
 2    False
 3     True
 4     True
 5    False
 6     True
 7    False
 dtype: bool,
 0    False
 1    False
 2    False
 3    False
 4     True
 5     True
 6     True
 7     True
 dtype: bool]

then, we concatenate them to create a single dataframe, with a column for each of the k values.

In [2]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)])
Out[2]: 
0    False
1     True
2    False
3     True
4     True
5    False
6     True
7    False
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
dtype: bool

finally, we pick to use as index all rows that meets the criteria for any column (i.e. value of n). So: if it is true for any n, we will return it:

In [3]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)
Out[3]: 
0    False
1     True
2    False
3     True
4     True
5     True
6     True
7     True
dtype: bool

Then, all you need to do is to project over your original dataframe and pick up the index.

In [3]: df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
Out[3]: Int64Index([0, 2], dtype='int64')

Final note

If the criteria must be met for all the values n in [1, k], then replace .any with .all.

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

The trick here is to calculate the rolling sum on a boolean mask to find out the number of values in k previous rows where third column is less than the second column

k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)

print(df[m1 & m2])

   first  second  third
3      3       4      5
7      3       4      2
8      5       6      6

Upvotes: 3

Related Questions