Reputation: 180
I have and million entry dataset that contains observations typed by humans to indicate certain 'operational' outcomes. Trying to create some categories i need to look at this column and extract certain EXACT! expressions that are most commonly used. They can appear at the start, end or middle of the string, and may or may not be abbreviated.
I have constructed the following example:
data = {'file': ['1','2','3','4','5','6'],
'observations': ['text one address', 'text 2 some',
'text home 3', 'notified text 4',
'text 5 add','text 6 homer']}
df = pd.DataFrame(data=data)
I am trying to use pandas to see if i can isolate and extract say 'home','not' and 'address'.
I have tried the following approach... (the '|'join
taken from another answer on this site)
conditions = ['home','not','address']
test = df[df['observations'].str.contains('|'.join(conditions))]
str.contains
Won't work because it picks up 6: 'text 6 homer' as it contains 'home' (the real case its even worse because with abbreviations there is stuff like 'ho', for example.
str.match
won't work because it will pickup 'notified'.
str.fullmatch
won't work because it can only look for exact strings, and these are long sentences...
Help appreciated...
Upvotes: 0
Views: 1954
Reputation: 120409
Is it what you expect:
>>> df[df['observations'].str.contains(fr"\b(?:{'|'.join(conditions)})\b")]
file observations
0 1 text one address
2 3 text home 3
\b
assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
(?:...)
non-capturing group
Upvotes: 4