Radix
Radix

Reputation: 264

Remove keywords which are not bigram or trigram (Yake)

I am using Yake (Yet Another Keyword Extractor) to extract keywords from a dataframe. I want to extract only bigrams and trigrams, but Yake allows only to set a max ngram size and not a min size. How do you would remove them?

Example df.head(0):

Text: 'oui , yes , i mumbled , the linguistic transition now in limbo .'

Keywords: '[('oui', 0.04491197687864554), ('linguistic transition', 0.09700399286574239), ('mumbled', 0.15831692877998726)]'

I want to remove oui, mumbled and their scores from keywords column.

Thank you for your time!

Upvotes: 3

Views: 619

Answers (2)

Conso
Conso

Reputation: 88

If your problem is that the keywords list contains some monograms, you can simply do a filter that ignores words without spaces and create a new list. I'll give you an example:

keywords_without_unigrams = []
for kw in keywords:
    if(' ' in kw[0]):
        keywords_without_unigrams.append(kw)
 

for kw in keywords_without_unigrams:
    print(kw)

Upvotes: 3

Raspberry PyCharm
Raspberry PyCharm

Reputation: 104

If you need the handle the mono-gram case from Yake just pass the output through a filter that adds the n-grams to the result list only if there is a space in the first element of that tuple or if the str.split() of that element results in more than 1 sub-element. If you're using a function and applying it to the dataframe, include this step in that function.

Upvotes: 1

Related Questions