Reputation: 264
I am using Yake (Yet Another Keyword Extractor) to extract keywords from a dataframe. I want to extract only bigrams and trigrams, but Yake allows only to set a max ngram size and not a min size. How do you would remove them?
Example df.head(0):
Text: 'oui , yes , i mumbled , the linguistic transition now in limbo .'
Keywords: '[('oui', 0.04491197687864554), ('linguistic transition', 0.09700399286574239), ('mumbled', 0.15831692877998726)]'
I want to remove oui, mumbled and their scores from keywords column.
Thank you for your time!
Upvotes: 3
Views: 619
Reputation: 88
If your problem is that the keywords list contains some monograms, you can simply do a filter that ignores words without spaces and create a new list. I'll give you an example:
keywords_without_unigrams = []
for kw in keywords:
if(' ' in kw[0]):
keywords_without_unigrams.append(kw)
for kw in keywords_without_unigrams:
print(kw)
Upvotes: 3
Reputation: 104
If you need the handle the mono-gram case from Yake just pass the output through a filter that adds the n-grams to the result list only if there is a space in the first element of that tuple or if the str.split() of that element results in more than 1 sub-element. If you're using a function and applying it to the dataframe, include this step in that function.
Upvotes: 1