pandas How do I delete rows from an ordered dataframe until a specific value appears?

Question

How do I delete rows in Python/ pandas for each column A until the first C value appears in column B?

input

index   | A   | B   |
--------|-----|-----|
0       | X   | A   |  
1       | X   | B   |
2       | X   | C   |  
3       | X   | A   |
4       | X   | C   |
5       | X   | ... |  
6       | Y   | A   |  
7       | Y   | B   |
8       | Y   | C   |  
9       | Y   | A   |
10      | Y   | C   |
11      | Y   | ... |

desired output

index   | A   | B   |
--------|-----|-----|
2       | X   | C   |  
3       | X   | A   |
4       | X   | C   |
5       | X   | ... |  
8       | Y   | C   |  
9       | Y   | A   |
10      | Y   | C   |
11      | Y   | ... |

DSM · Accepted Answer

Assuming that you either don't have a second X cluster after the Y cluster (or that you consider all X clusters the same, even if they're discontiguous):

One way would be to compare the column B with C (getting a boolean series of False and True), then groupby the A column. The cumulative maximum of these will be like False, False, True, True, True where we start getting Trues the first time we see C, and we can use that to index into df. For example:

In [19]: df.loc[(df["B"] == "C").groupby(df["A"]).cummax()]
Out[19]: 
    A    B
2   X    C
3   X    A
4   X    C
5   X  ...
8   Y    C
9   Y    A
10  Y    C
11  Y  ...

because

In [20]: df["B"] == "C"
Out[20]: 
0     False
1     False
2      True
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10     True
11    False
Name: B, dtype: bool

and

In [21]: (df["B"] == "C").groupby(df["A"]).cummax()
Out[21]: 
0     False
1     False
2      True
3      True
4      True
5      True
6     False
7     False
8      True
9      True
10     True
11     True
Name: B, dtype: bool

pandas How do I delete rows from an ordered dataframe until a specific value appears?

Answers (1)

Related Questions