Reputation: 377
My DataFrame looks something like this:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
What I am trying to acheive is if substring start and end is present I want the rows between them. Expected Result is:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?
m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]
But this only returns the first set of start and end, also it expects the exact string.
output:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
+----------------------------------+---------+
Upvotes: 3
Views: 42
Reputation: 260745
The answer you linked to was designed to work with a single pair of Start/End.
A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):
m1 = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)
# boolean indexing
out = df[m1|m1.shift()]
Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of @Quang does:
# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)
# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()
# boolean slicing
out = df[m1|m1.shift()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Upvotes: 3
Reputation: 28644
One option is with an interval index:
Get the positions of the starts and ends:
starts = df.Col1.str.startswith("Start").to_numpy().nonzero()[0]
ends = df.Col1.str.startswith("End").to_numpy().nonzero()[0]
Build an interval index, and get matches where the index lies between Start
and End
:
intervals = pd.IntervalIndex.from_arrays(starts, ends, closed='both')
intervals = intervals.get_indexer(df.index)
Filter the original dataframe with the intervals, where intervals are not less than 0:
df.loc[intervals >= 0]
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Upvotes: 2
Reputation: 150745
Let's try:
# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)
# keep rows with equal forward fill and backward fill
df[groups.bfill() == groups.ffill()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Upvotes: 3