BhishanPoudel
BhishanPoudel

Reputation: 17164

How to extract single word (not larger word containing it) in pandas dataframe?

I would like to extract the word like this:

a dog ==> dog
some dogs ==> dog
dogmatic ==> None

There is a similar link: Extract substring from text in a pandas DataFrame as new column

But it does not fulfill my requirements.

From this dataframe:

df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                               'C likes cats.', 'D likes cat!', 
                               'E is educated',
                              'F is catholic',
                              'G likes cat, he has three of them.',
                              'H likes cat; he has four of them.',
                              'I adore !!cats!!',
                              'x is dogmatic',
                              'x is eating hotdogs.',
                              'x likes dogs, he has three of them.',
                              'x likes dogs; he has four of them.',
                              'x adores **dogs**'
                              ]})

How to get correct output?

                            comment      label EXTRACT
0                           A likes cat   cat     cat
1                          B likes Cats   cat     cat
2                         C likes cats.   cat     cat
3                          D likes cat!   cat     cat
4                         E is educated  None     cat
5                         F is catholic  None     cat
6    G likes cat, he has three of them.   cat     cat
7     H likes cat; he has four of them.   cat     cat
8                      I adore !!cats!!   cat     cat
9                         x is dogmatic  None     dog
10                 x is eating hotdogs.  None     dog
11  x likes dogs, he has three of them.   dog     dog
12   x likes dogs; he has four of them.   dog     dog
13                    x adores **dogs**   dog     dog

NOTE: The column EXTRACT gives wrong answer, I need like the column label.

enter image description here

Upvotes: 0

Views: 189

Answers (6)

user2672299
user2672299

Reputation: 423

What you are trying to achieve is extracting the label of your sentence. It is a natural language processing problem not a programming problem.

Approaches:

  1. Use a stemmer/lemmatizer . You could match the output of the stemmer with your stemmed class name list. This will most likely not give you a high enough accuracy.
  2. Train a machine learning classifier on your topics/labels.

Lemmatizer solution - I used some preprocessing code from another answer in this question

import nltk
import pandas as pd

lemma = nltk.wordnet.WordNetLemmatizer()
nltk.download('wordnet')


df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                           'C likes cats.', 'D likes cat!', 
                           'E is educated',
                          'F is catholic',
                          'G likes cat, he has three of them.',
                          'H likes cat; he has four of them.',
                          'I adore !!cats!!',
                          'x is dogmatic',
                          'x is eating hotdogs.',
                          'x likes dogs, he has three of them.',
                          'x likes dogs; he has four of them.',
                          'x adores **dogs**'
                          ]})

word_list = ["cat",  "dog"]    # words (and all variations) that you wish to check for
word_list = list(map(lemma.lemmatize, word_list))


df["label"] = df["comment"].str.lower().str.replace('[^a-zA-Z]', ' ').apply(lambda x: [ lemma.lemmatize(word) for word in x.split()  ])
df["label"] = df["label"].apply(lambda x: [i for i in word_list if i in x])

df["label"] = df["label"].apply(lambda x: None if not x else x)
print(df)

Upvotes: 2

Ted
Ted

Reputation: 1233

df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                           'C likes cats.', 'D likes cat!', 
                           'E is educated',
                          'F is catholic',
                          'G likes cat, he has three of them.',
                          'H likes cat; he has four of them.',
                          'I adore !!cats!!',
                          'x is dogmatic',
                          'x is eating hotdogs.',
                          'x likes dogs, he has three of them.',
                          'x likes dogs; he has four of them.',
                          'x adores **dogs**'
                          ]})

word_list = ["cat", "cats", "dog", "dogs"]    # words (and all variations) that you wish to check for

df["label"] = df["comment"].str.lower().str.replace('[^\w\s]','').str.split().apply(lambda x: [i for i in word_list if i in x])
df["label"] = df["label"].apply(lambda x: None if not x else x)
df["label"] = df["label"].str.replace("[","").str.replace("]","").str.replace("'","").str.replace("s","")

Then that gives you:

df
    comment                             label
0   A likes cat                         cat
1   B likes Cats                        cat
2   C likes cats.                       cat
3   D likes cat!                        cat
4   E is educated                       None
5   F is catholic                       None
6   G likes cat, he has three of them.  cat
7   H likes cat; he has four of them.   cat
8   I adore !!cats!!                    cat
9   x is dogmatic                       None
10  x is eating hotdogs.                None
11  x likes dogs, he has three of them. dog
12  x likes dogs; he has four of them.  dog
13  x adores **dogs**                   dog

Upvotes: 2

MonkeyZeus
MonkeyZeus

Reputation: 20747

Something like this?

/^(.*?[^a-z\r\n])?((cat|dog)s?)([^a-z\r\n].*?)?$/gmi

\2 will contain one of: cat, dog, cats, dogs

https://regex101.com/r/Tt3MiZ/3

Upvotes: 1

Erfan
Erfan

Reputation: 42946

We can use str.extract with negative lookahead: ?!. We check if the the characters after the match are not more than 2 letters. For example dogmatic:

After that we use np.where with positive lookahead. The pseudo logic is like following:

All the rows which have "dog" or "cat" with alphabetic characters in front of it will be be replaced by NaN

words = ['cat', 'dog']

df['label'] = df['comment'].str.extract('(?i)'+'('+'|'.join(words)+')(?![A-Za-z]{2,})')
df['label'] = np.where(df['comment'].str.contains('(?<=\wdog)|(?<=\wcat)'), np.NaN, df['label'])

Output

                                comment label
0                           A likes cat   cat
1                          B likes Cats   Cat
2                         C likes cats.   cat
3                          D likes cat!   cat
4                         E is educated   NaN
5                         F is catholic   NaN
6    G likes cat, he has three of them.   cat
7     H likes cat; he has four of them.   cat
8                      I adore !!cats!!   cat
9                         x is dogmatic   NaN
10                 x is eating hotdogs.   NaN
11  x likes dogs, he has three of them.   dog
12   x likes dogs; he has four of them.   dog
13                    x adores **dogs**   dog

Upvotes: 4

robscurity
robscurity

Reputation: 7

In this case I imagine you don't even need to use regex. Just use the equal-to operator == to specify the exact match since you're looking for "dog" "dogs" "cat" "cats" as the entire word. For example:

for word in string:
    if word == "dogs":
        print("Yes")
    else:
        print("No")

If your string were "he likes hotdogs", the above loop would return "No"

Upvotes: -1

P K
P K

Reputation: 1

You can compile regex for cat, cats, dog and dogs.

import re
regex = re.compile(r'cats', re.I)
data = ['A likes cat', 'B likes Cats',
                           'C likes cats.', 'D likes cat!', 
                           'E is educated',
                          'F is catholic',
                          'G likes cat, he has three of them.',
                          'H likes cat; he has four of them.',
                          'I adore !!cats!!',
                          'x is dogmatic',
                          'x is eating hotdogs.',
                          'x likes dogs, he has three of them.',
                          'x likes dogs; he has four of them.',
                          'x adores **dogs**'
                          ]
for i in data:
    t = regex.search(i)
    print(t)

Upvotes: -2

Related Questions