Reputation: 23
Having a brain-fart here probably. I'm getting a list using generators and I have trouble removing duplicates from my list the usual way using set
import spacy
import textacy
nlp = spacy.load("en_core_web_lg")
text = ('''The Western compulsion to hedonism has made us lust for money just to show that we have it. Possessions do not make a man—man makes possessions. The Western compulsion to hedonism has made us lust for money just to show that we have it. Possessions do not make a man—man makes possessions.''')
doc = nlp(text)
keywords = list(textacy.extract.ngrams(doc, 1, filter_stops=True, filter_punct=True, filter_nums=False)) + list(textacy.extract.ngrams(doc, 2, filter_stops=True, filter_punct=True, filter_nums=False))
print(list(set(keywords)))
the result contains duplicates:
[man, lust, makes possessions, man, Possessions, makes possessions, man makes, hedonism, man, money, compulsion, Western compulsion, man, possessions, man makes, compulsion, Possessions, Western compulsion, possessions, Western, makes, makes, lust, hedonism, Western, money]
Upvotes: 1
Views: 581
Reputation: 1776
That's because the the items in your list aren't strings, so they aren't actually duplicates.
>>> type(keywords[0])
spacy.tokens.span.Span
To only get the duplicate words, you can use a dictionary comprehension with their string representations as the unique keys, and then just take out the spacy objects with .values()
:
uniques = list({keyword.__repr__(): keyword for keyword in keywords}.values())
Upvotes: 2