Reputation: 109
How do I delete rows in Python/ pandas for each column A until the first C value appears in column B?
input
index | A | B |
--------|-----|-----|
0 | X | A |
1 | X | B |
2 | X | C |
3 | X | A |
4 | X | C |
5 | X | ... |
6 | Y | A |
7 | Y | B |
8 | Y | C |
9 | Y | A |
10 | Y | C |
11 | Y | ... |
desired output
index | A | B |
--------|-----|-----|
2 | X | C |
3 | X | A |
4 | X | C |
5 | X | ... |
8 | Y | C |
9 | Y | A |
10 | Y | C |
11 | Y | ... |
Upvotes: 0
Views: 83
Reputation: 353059
Assuming that you either don't have a second X cluster after the Y cluster (or that you consider all X clusters the same, even if they're discontiguous):
One way would be to compare the column B with C (getting a boolean series of False and True), then groupby the A column. The cumulative maximum of these will be like False, False, True, True, True where we start getting Trues the first time we see C, and we can use that to index into df. For example:
In [19]: df.loc[(df["B"] == "C").groupby(df["A"]).cummax()]
Out[19]:
A B
2 X C
3 X A
4 X C
5 X ...
8 Y C
9 Y A
10 Y C
11 Y ...
because
In [20]: df["B"] == "C"
Out[20]:
0 False
1 False
2 True
3 False
4 True
5 False
6 False
7 False
8 True
9 False
10 True
11 False
Name: B, dtype: bool
and
In [21]: (df["B"] == "C").groupby(df["A"]).cummax()
Out[21]:
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
9 True
10 True
11 True
Name: B, dtype: bool
Upvotes: 3