Reputation: 735
I have a Pandas dataframe as below. What I am trying to do is check if a station has variable yyy
and any other variable on the same day (as in the case of station1
). If this is true I need to delete the whole row containing yyy
.
Currently I am doing this using iterrows()
and looping to search the days in which this variable appears, changing the variable to something like "delete me", building a new dataframe from this (because pandas doesn't support replacing in place) and filtering the new dataframe to get rid of the unwanted rows. This works now because my dataframes are small, but is not likely to scale.
Question: This seems like a very "non-Pandas" way to do this, is there some other method of deleting out the unwanted variables?
dateuse station variable1
0 2012-08-12 00:00:00 station1 xxx
1 2012-08-12 00:00:00 station1 yyy
2 2012-08-23 00:00:00 station2 aaa
3 2012-08-23 00:00:00 station3 bbb
4 2012-08-25 00:00:00 station4 ccc
5 2012-08-25 00:00:00 station4 ccc
6 2012-08-25 00:00:00 station4 ccc
Upvotes: 3
Views: 3382
Reputation: 352999
I might index using a boolean array. We want to delete rows (if I understand what you're after, anyway!) which have yyy
and more than one dateuse
/station
combination.
We can use transform
to broadcast the size of each dateuse
/station
combination up to the length of the dataframe, and then select the rows in groups which have length > 1. Then we can &
this with where the yyy
s are.
>>> multiple = df.groupby(["dateuse", "station"])["variable1"].transform(len) > 1
>>> must_be_isolated = df["variable1"] == "yyy"
>>> df[~(multiple & must_be_isolated)]
dateuse station variable1
0 2012-08-12 00:00:00 station1 xxx
2 2012-08-23 00:00:00 station2 aaa
3 2012-08-23 00:00:00 station3 bbb
4 2012-08-25 00:00:00 station4 ccc
5 2012-08-25 00:00:00 station4 ccc
6 2012-08-25 00:00:00 station4 ccc
Upvotes: 4