Reputation: 661
I have a large time series data frame (called df), and the first 5 records look like this:
df
stn years_of_data total_minutes avg_daily TOA_daily K_daily
date
1900-01-14 AlberniElementary 4 5745 34.100 114.600 0.298
1900-01-14 AlberniWeather 6 7129 29.500 114.600 0.257
1900-01-14 Arbutus 8 11174 30.500 114.600 0.266
1900-01-14 Arrowview 7 10080 27.600 114.600 0.241
1900-01-14 Bayside 7 9745 33.800 114.600 0.295
Goal:
I am trying to remove rows where any of the strings in a list are present in the 'stn' column. So,I am basically trying to filter this dataset to not include rows containing any of the strings in following list.
Attempt:
remove_list = ['Arbutus','Bayside']
cleaned = df[df['stn'].str.contains('remove_list')]
Returns:
Out[78]:
stn years_of_data total_minutes avg_daily TOA_daily K_daily
date
Nothing!
I have tried a few combinations of quotes, brackets, and even a lambda function; though I am fairly new, so probably not using syntax properly..
Upvotes: 54
Views: 70969
Reputation: 582
I just want to add my 2 cents to this very important use case (of filtering out a list of items, indexed by string values). The argument of .isin()
method, doesn't need to be a list! It can be a pd.Series! Then you can do things like this:
df[~df['stn'].isin(another_df['stn_to_remove_column_there'])]
See what I mean? You can use this construct without a .to_list()
method.
Upvotes: 5
Reputation: 485
Had a similar question, found this old thread, I think there are other ways to get the same result. My issue with @EdChum's solution for my particular application is that I don't have a list that will be matched exactly. If you have the same issue, .isin
isn't meant for that application.
Instead, you can also try a few options, including a numpy.where:
removelist = ['ayside','rrowview']
df['flagCol'] = numpy.where(df.stn.str.contains('|'.join(remove_list)),1,0)
Note that this solution doesn't actually remove the matching rows, just flags them. You can copy/slice/drop as you like.
This solution would be useful in the case that you don't know, for example, if the station names are capitalized or not and don't want to go through standardizing text beforehand. numpy.where
is usually pretty fast as well, probably not much different from .isin
.
Upvotes: 24
Reputation: 394051
Use isin:
cleaned = df[~df['stn'].isin(remove_list)]
In [7]:
remove_list = ['Arbutus','Bayside']
df[~df['stn'].isin(remove_list)]
Out[7]:
stn years_of_data total_minutes avg_daily \
date
1900-01-14 AlberniElementary 4 5745 34.1
1900-01-14 AlberniWeather 6 7129 29.5
1900-01-14 Arrowview 7 10080 27.6
TOA_daily K_daily
date
1900-01-14 114.6 0.298
1900-01-14 114.6 0.257
1900-01-14 114.6 0.241
Upvotes: 88