Reputation: 17164
I would like to extract the word like this:
a dog ==> dog
some dogs ==> dog
dogmatic ==> None
There is a similar link: Extract substring from text in a pandas DataFrame as new column
But it does not fulfill my requirements.
From this dataframe:
df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
'C likes cats.', 'D likes cat!',
'E is educated',
'F is catholic',
'G likes cat, he has three of them.',
'H likes cat; he has four of them.',
'I adore !!cats!!',
'x is dogmatic',
'x is eating hotdogs.',
'x likes dogs, he has three of them.',
'x likes dogs; he has four of them.',
'x adores **dogs**'
]})
How to get correct output?
comment label EXTRACT
0 A likes cat cat cat
1 B likes Cats cat cat
2 C likes cats. cat cat
3 D likes cat! cat cat
4 E is educated None cat
5 F is catholic None cat
6 G likes cat, he has three of them. cat cat
7 H likes cat; he has four of them. cat cat
8 I adore !!cats!! cat cat
9 x is dogmatic None dog
10 x is eating hotdogs. None dog
11 x likes dogs, he has three of them. dog dog
12 x likes dogs; he has four of them. dog dog
13 x adores **dogs** dog dog
Upvotes: 0
Views: 189
Reputation: 423
What you are trying to achieve is extracting the label of your sentence. It is a natural language processing problem not a programming problem.
Approaches:
Lemmatizer solution - I used some preprocessing code from another answer in this question
import nltk
import pandas as pd
lemma = nltk.wordnet.WordNetLemmatizer()
nltk.download('wordnet')
df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
'C likes cats.', 'D likes cat!',
'E is educated',
'F is catholic',
'G likes cat, he has three of them.',
'H likes cat; he has four of them.',
'I adore !!cats!!',
'x is dogmatic',
'x is eating hotdogs.',
'x likes dogs, he has three of them.',
'x likes dogs; he has four of them.',
'x adores **dogs**'
]})
word_list = ["cat", "dog"] # words (and all variations) that you wish to check for
word_list = list(map(lemma.lemmatize, word_list))
df["label"] = df["comment"].str.lower().str.replace('[^a-zA-Z]', ' ').apply(lambda x: [ lemma.lemmatize(word) for word in x.split() ])
df["label"] = df["label"].apply(lambda x: [i for i in word_list if i in x])
df["label"] = df["label"].apply(lambda x: None if not x else x)
print(df)
Upvotes: 2
Reputation: 1233
df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
'C likes cats.', 'D likes cat!',
'E is educated',
'F is catholic',
'G likes cat, he has three of them.',
'H likes cat; he has four of them.',
'I adore !!cats!!',
'x is dogmatic',
'x is eating hotdogs.',
'x likes dogs, he has three of them.',
'x likes dogs; he has four of them.',
'x adores **dogs**'
]})
word_list = ["cat", "cats", "dog", "dogs"] # words (and all variations) that you wish to check for
df["label"] = df["comment"].str.lower().str.replace('[^\w\s]','').str.split().apply(lambda x: [i for i in word_list if i in x])
df["label"] = df["label"].apply(lambda x: None if not x else x)
df["label"] = df["label"].str.replace("[","").str.replace("]","").str.replace("'","").str.replace("s","")
Then that gives you:
df
comment label
0 A likes cat cat
1 B likes Cats cat
2 C likes cats. cat
3 D likes cat! cat
4 E is educated None
5 F is catholic None
6 G likes cat, he has three of them. cat
7 H likes cat; he has four of them. cat
8 I adore !!cats!! cat
9 x is dogmatic None
10 x is eating hotdogs. None
11 x likes dogs, he has three of them. dog
12 x likes dogs; he has four of them. dog
13 x adores **dogs** dog
Upvotes: 2
Reputation: 20747
Something like this?
/^(.*?[^a-z\r\n])?((cat|dog)s?)([^a-z\r\n].*?)?$/gmi
\2
will contain one of: cat, dog, cats, dogs
https://regex101.com/r/Tt3MiZ/3
Upvotes: 1
Reputation: 42946
We can use str.extract
with negative lookahead
: ?!
. We check if the the characters after the match are not more than 2 letters. For example dogmatic
:
After that we use np.where
with positive lookahead
. The pseudo logic is like following:
All the rows which have "dog" or "cat" with alphabetic characters in front of it will be be replaced by NaN
words = ['cat', 'dog']
df['label'] = df['comment'].str.extract('(?i)'+'('+'|'.join(words)+')(?![A-Za-z]{2,})')
df['label'] = np.where(df['comment'].str.contains('(?<=\wdog)|(?<=\wcat)'), np.NaN, df['label'])
Output
comment label
0 A likes cat cat
1 B likes Cats Cat
2 C likes cats. cat
3 D likes cat! cat
4 E is educated NaN
5 F is catholic NaN
6 G likes cat, he has three of them. cat
7 H likes cat; he has four of them. cat
8 I adore !!cats!! cat
9 x is dogmatic NaN
10 x is eating hotdogs. NaN
11 x likes dogs, he has three of them. dog
12 x likes dogs; he has four of them. dog
13 x adores **dogs** dog
Upvotes: 4
Reputation: 7
In this case I imagine you don't even need to use regex. Just use the equal-to operator == to specify the exact match since you're looking for "dog" "dogs" "cat" "cats" as the entire word. For example:
for word in string:
if word == "dogs":
print("Yes")
else:
print("No")
If your string were "he likes hotdogs", the above loop would return "No"
Upvotes: -1
Reputation: 1
You can compile regex for cat, cats, dog and dogs.
import re
regex = re.compile(r'cats', re.I)
data = ['A likes cat', 'B likes Cats',
'C likes cats.', 'D likes cat!',
'E is educated',
'F is catholic',
'G likes cat, he has three of them.',
'H likes cat; he has four of them.',
'I adore !!cats!!',
'x is dogmatic',
'x is eating hotdogs.',
'x likes dogs, he has three of them.',
'x likes dogs; he has four of them.',
'x adores **dogs**'
]
for i in data:
t = regex.search(i)
print(t)
Upvotes: -2