Reputation: 47
How can I improve my code to search using a list of keywords in a specific column of a dataframe and return those rows that contains the value. the current code only accepts two keywords!
contain_values = df[df['tweet'].str.contains('free','news')]
contain_values.head()
Upvotes: 0
Views: 852
Reputation: 11395
Your code currently only returns tweets that contain 'free'
and ignores 'news'
. Let’s test it:
>>> df
tweet
0 free stuff
1 newsnewsnews
2 hello world
3 another tweet
>>> df[df['tweet'].str.contains('free', 'news')]
tweet
0 free stuff
See the documentation for .str.contains(): you can either pass a word, or a regular expression. This will work:
df[df['tweet'].str.contains('free|news|hello')]
Here I’ve added a 3rd keyword, and now the first 3 elements of my dataframe are returned:
tweet
0 free stuff
1 newsnewsnews
2 hello world
Upvotes: 0
Reputation: 5331
Series.str.contains
takes a regular expression, per the documentation. Either construct a regular expression with your values or use a for-loop to check multiple elements one-by-one and then aggregate back.
Thus (for the regular expression):
regex = '|'.join(['free', 'news'])
df['tweet'].str.contains(regex, case=False, na=False)
Note that you cannot pass a list directly to Series.str.contains
, it'll raise an error. You also probably want to pass case=False
and na=False
to make the regular expressions case-insensitive and pass False
if you have NaN
somewhere in your tweet columns (like for a no-comment retweet).
Upvotes: 1