itsbrycehere
itsbrycehere

Reputation: 49

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.

The head is here, as a tsv: https://pastebin.com/FnnBq9rf

Upvotes: 1

Views: 1295

Answers (3)

Isurie
Isurie

Reputation: 320

As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.

def NounCount(x):
    nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
    return nounCount

df["noun_count"] = df["reviews"].apply(NounCount)

df.to_csv('./dataset.csv')

Upvotes: 1

itsbrycehere
itsbrycehere

Reputation: 49

Thank you @zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!

def NounCounter(x):
   nouns = []
   for (word, pos) in x:
        if pos.startswith("NN"):
            nouns.append(word)
    return nouns

df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()

Upvotes: 1

Yilun Zhang
Yilun Zhang

Reputation: 9018

There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.

Assume you have something like this (please correct me if it's not, as you didn't provide an example):

a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])

You can do something like this to map the Series and sum the count:

a.map(lambda x: 1 if x[1]== "verb" else 0).sum()

This will return you 2.


I grabbed a sentence from the link you shared:

text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

Upvotes: 0

Related Questions