Reputation: 1699
I have a database with over 2 million rows. I'm trying to find rows that contain both of two words using regex like:
df1 = df[df['my_column'].str.contains(r'(?=.*first_word)(?=.*second_word)')]
However, when trying to process this in jupyter notebook, it either takes over a minute to return these rows or it crashes the kernal and I have to try again.
Is there a more efficient way for me to return rows in a dataframe that contain both words?
Upvotes: 1
Views: 289
Reputation: 18631
Use
df['my_column'].apply(lambda x: all(l in x for l in ['first_word', 'second_word']) )
It will make sure the words from the list are all present in the my_column
column without an awkward regex.
Upvotes: 1