Finding multiple exact string matches in a dataframe column using PANDAS

Question

I have and million entry dataset that contains observations typed by humans to indicate certain 'operational' outcomes. Trying to create some categories i need to look at this column and extract certain EXACT! expressions that are most commonly used. They can appear at the start, end or middle of the string, and may or may not be abbreviated.

I have constructed the following example:

data = {'file': ['1','2','3','4','5','6'],
        'observations': ['text one address', 'text 2 some', 
                         'text home 3', 'notified text 4',
                         'text 5 add','text 6 homer']}

df = pd.DataFrame(data=data)

I am trying to use pandas to see if i can isolate and extract say 'home','not' and 'address'. I have tried the following approach... (the '|'join taken from another answer on this site)

conditions = ['home','not','address']
test = df[df['observations'].str.contains('|'.join(conditions))]

str.contains Won't work because it picks up 6: 'text 6 homer' as it contains 'home' (the real case its even worse because with abbreviations there is stuff like 'ho', for example.
str.match won't work because it will pickup 'notified'.
str.fullmatch won't work because it can only look for exact strings, and these are long sentences...

Help appreciated...

Corralien · Accepted Answer

Is it what you expect:

>>> df[df['observations'].str.contains(fr"\b(?:{'|'.join(conditions)})\b")]

  file      observations
0    1  text one address
2    3       text home 3

\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)

(?:...) non-capturing group

Finding multiple exact string matches in a dataframe column using PANDAS

Answers (1)

Related Questions