Remove keywords which are not bigram or trigram (Yake)

Question

I am using Yake (Yet Another Keyword Extractor) to extract keywords from a dataframe. I want to extract only bigrams and trigrams, but Yake allows only to set a max ngram size and not a min size. How do you would remove them?

Example df.head(0):

Text: 'oui , yes , i mumbled , the linguistic transition now in limbo .'

Keywords: '[('oui', 0.04491197687864554), ('linguistic transition', 0.09700399286574239), ('mumbled', 0.15831692877998726)]'

I want to remove oui, mumbled and their scores from keywords column.

Thank you for your time!

Conso · Accepted Answer

If your problem is that the keywords list contains some monograms, you can simply do a filter that ignores words without spaces and create a new list. I'll give you an example:

keywords_without_unigrams = []
for kw in keywords:
    if(' ' in kw[0]):
        keywords_without_unigrams.append(kw)
 

for kw in keywords_without_unigrams:
    print(kw)

Remove keywords which are not bigram or trigram (Yake)

Answers (2)

Related Questions