abhi
abhi

Reputation: 377

filter and get rows between the conditions in a dataframe

My DataFrame looks something like this:


+----------------------------------+---------+
|               Col1               |  Col2   |
+----------------------------------+---------+
| Start A                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End A                            |    6    |
| value 6                          |    3    |
| value 7                          |    4    |
| value 8                          |    5    |
| Start B                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End B                            |    6    |
| value 6                          |    3    |
| value 7                          |    4    |
| value 8                          |    5    |
| Start C                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End C                            |    6    |
+----------------------------------+---------+

What I am trying to acheive is if substring start and end is present I want the rows between them. Expected Result is:


+----------------------------------+---------+
|               Col1               |  Col2   |
+----------------------------------+---------+
| Start A                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End A                            |    6    |
| Start B                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End B                            |    6    |
| Start C                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End C                            |    6    |
+----------------------------------+---------+

I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?

m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]

But this only returns the first set of start and end, also it expects the exact string.

output:

+----------------------------------+---------+
|               Col1               |  Col2   |
+----------------------------------+---------+
| Start A                          |    1    |
| value 1                          |    2    |
| value 2                          |    3    |
| value 3                          |    4    |
| value 5                          |    5    |
| End A                            |    6    |
+----------------------------------+---------+

Upvotes: 3

Views: 42

Answers (3)

mozway
mozway

Reputation: 260745

The answer you linked to was designed to work with a single pair of Start/End.

A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):

m1 = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)

# boolean indexing
out = df[m1|m1.shift()]

Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of @Quang does:

# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)

# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()

# boolean slicing
out = df[m1|m1.shift()]

Output:

       Col1  Col2
0   Start A     1
1   value 1     2
2   value 2     3
3   value 3     4
4   value 5     5
5     End A     6
9   Start B     1
10  value 1     2
11  value 2     3
12  value 3     4
13  value 5     5
14    End B     6
18  Start C     1
19  value 1     2
20  value 2     3
21  value 3     4
22  value 5     5
23    End C     6

Upvotes: 3

sammywemmy
sammywemmy

Reputation: 28644

One option is with an interval index:

Get the positions of the starts and ends:

starts = df.Col1.str.startswith("Start").to_numpy().nonzero()[0]
ends = df.Col1.str.startswith("End").to_numpy().nonzero()[0]

Build an interval index, and get matches where the index lies between Start and End:

intervals = pd.IntervalIndex.from_arrays(starts, ends, closed='both')
intervals = intervals.get_indexer(df.index)

Filter the original dataframe with the intervals, where intervals are not less than 0:

df.loc[intervals >= 0]
       Col1  Col2
0   Start A     1
1   value 1     2
2   value 2     3
3   value 3     4
4   value 5     5
5     End A     6
9   Start B     1
10  value 1     2
11  value 2     3
12  value 3     4
13  value 5     5
14    End B     6
18  Start C     1
19  value 1     2
20  value 2     3
21  value 3     4
22  value 5     5
23    End C     6

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150745

Let's try:

# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)

# keep rows with equal forward fill and backward fill 
df[groups.bfill() == groups.ffill()]

Output:

       Col1  Col2
0   Start A     1
1   value 1     2
2   value 2     3
3   value 3     4
4   value 5     5
5     End A     6
9   Start B     1
10  value 1     2
11  value 2     3
12  value 3     4
13  value 5     5
14    End B     6
18  Start C     1
19  value 1     2
20  value 2     3
21  value 3     4
22  value 5     5
23    End C     6

Upvotes: 3

Related Questions