Reputation: 859
I have a dataset that looks like this:
ID Symptoms
1 ear, fever
2 hearing loss
3 hurt ear
4 spear wound
5 bad hearing
6 earring cut
I want to flag only the records where "ear" appears. So for example, the output would look like this:
ID Symptoms Ear
1 ear, fever 1
2 hearing loss 0
3 hurt ear 1
4 spear wound 0
5 bad hearing 0
6 earring cut 0
I've played around with some code with little success:
Issue: this code would pull anything with the text "ear"
LABS_TAT.loc[:,"Ear"]=np.where(LABS_TAT["Symptoms"].str.contains("ear", case=False),1,0)
Notice the space after "ear ", this code would not flag the record "hurt ear"
LABS_TAT.loc[:,"Ear"]=np.where(LABS_TAT["Symptoms"].str.contains("ear ", case=False),1,0)
Notice the space before " ear", this code would not flag the record "ear, fever"
LABS_TAT.loc[:,"Ear"]=np.where(LABS_TAT["Symptoms"].str.contains(" ear", case=False),1,0)
How can I fix my code so that it flags any records with the word "ear"? I feel like there is a simple answer but I'm still somewhat a newb to python.
Upvotes: 1
Views: 326
Reputation: 41277
Since .contains()
takes a regex pattern, this should be as easy as .contains(r"\bear\b", case=False)
.
\b
indicates a word-boundry character. You can read more about regular expressions in the Python standard library documentation.
Upvotes: 1
Reputation: 71689
Use Series.str.contains
with a regex pattern:
df['Ear'] = df['Symptoms'].str.contains(r'(?i)\bear\b').astype(int)
Result:
ID Symptoms Ear
0 1 ear, fever 1
1 2 hearing loss 0
2 3 hurt ear 1
3 4 spear wound 0
4 5 bad hearing 0
5 6 earring cut 0
Upvotes: 1