jeangelj
jeangelj

Reputation: 4498

python pandas get ride of plural "s" in words to prepare for word count

I have the following python pandas dataframe:

Question_ID | Customer_ID | Answer
    1           234         The team worked very hard ...
    2           234         All the teams have been working together ...

I am going to use my code to count words in the answer column. But beforehand, I want to take out the "s" from the word "teams", so that in the example above I count team: 2 instead of team:1 and teams:1.

How can I do this for all words?

Upvotes: 1

Views: 1655

Answers (3)

piRSquared
piRSquared

Reputation: 294488

use str.replace to remove the s from any 3 or more letter word that ends in 's'.

df.Answer.str.replace(r'(\w{2,})s\b', r'\1')

0                  The team worked very hard ...
1    All the team have been working together ...
Name: Answer, dtype: object

'{2,}' specifies 2 or more. That combined with the 's' ensures that you'll miss 'is'. You can set it to '{3,}' to ensure you skip 'its' as well.

Upvotes: 1

Little Bobby Tables
Little Bobby Tables

Reputation: 4742

Try the NTLK toolkit. Specifically Stemming and Lemmatization. I have never personally used it but here you can try it out.

Here is an example of some tricky plurals,

its it's his quizzes fishes maths mathematics

becomes

it it ' s hi quizz fish math mathemat

You can see it deals with "his" (and "mathematics") poorly, but then again you could have lots of abbreviated "hellos". This is the nature of the beast.

Upvotes: 0

DYZ
DYZ

Reputation: 57085

You need to use a tokenizer (for breaking a sentence into words) and lemmmatizer (for standardizing word forms), both provided by the natural language toolkit nltk:

import nltk
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(word) for word in nltk.wordpunct_tokenize(sentence)]
# ['All', 'the', 'team', 'have', 'been', 'working', 'together']

Upvotes: 7

Related Questions