Reputation: 137
I am working on Twitter data and trying to find strings that contain more than one word. The following line works for one word and with the OR condition.
tweets_text[tweets_text.str.contains("break")] #Find strings with the word break
tweets_text[tweets_text.str.contains("break|social|media")] #Find strings with either break or social, or media
I am trying to find the strings that have these three words ("break & social & media")
Upvotes: 2
Views: 3447
Reputation: 138
You can always add some additional parameters to ignore uppercase or lowercase letters, using flags
. Using @Rutger 's code. Check the documentation for some additional parameters.
tweets_text.loc[tweets_text.str.contains("break", flags = re.IGNORECASE) & tweets_text.str.contains("social") & tweets_text.str.contains("media", flags = re.IGNORECASE)]
In addition to that you can do the same things by combining lambda
function and all
, as follows:
def find_words(data, list_of_words):
function = lambda row: all(word.lower() in row.lower()
for word in list_of_words)
return data.loc[data[column_name].apply(function)]
Upvotes: 0
Reputation: 1317
df = pd.Series(['break', 'break media social', 'break media'])
Series:
0 break
1 break media social
2 break media
extraciton:
tweets_text[tweets_text.str.contains('(?=.*break)(?=.*social)(?=.*media)')]
output:
1 break media social
Upvotes: 3
Reputation: 603
You can split them up like this:
tweets_text.loc[tweets_text.str.contains("break") & tweets_text.str.contains("social") & tweets_text.str.contains("media")]
Upvotes: 1