Reputation: 179
First experience with NLP here. I have about a half million tweets. I'm trying to use spacy to remove stop words, lemmatize, etc. and then pass the processed text to a classification model. Because of the size of the data I need multiprocessing to do this in reasonable speed, but can't figure out what to do with the generator object once I have it.
Here I load spacy and pass the data through the standard pipeline:
nlp = spacy.load('en')
tweets = ['This is a dummy tweet for stack overflow',
'What do we do with generator objects?']
spacy_tweets = []
for tweet in tweets:
doc_tweet = nlp.pipe(tweet, batch_size = 10, n_threads = 3)
spacy_tweets.append(doc_tweet)
Now I'd like to take the Doc objects spaCy creates and then process the text with something like this:
def spacy_tokenizer(tweet):
tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
tweet = [tok for tok in tweet if (tok not in stopwords and tok not in punctuations)]
return tweet
But this doesn't work because spaCy returns generator objects when using the .pipe() method. So when I do this:
for tweet in spacy_tweets:
print(tweet)
It prints the generator. Okay, I get that. But when I do this:
for tweet in spacy_tweets[0]:
print(tweet)
I would expect it to print the Doc object or the text of the tweet in the generator but it doesn't do that. Instead it prints each character our individually.
Am I approaching this wrong or is there something I need to do in order to retrieve the Doc objects from the generator objects so I can use the spaCy attributes for lemmatizing etc.?
Upvotes: 2
Views: 2656
Reputation: 10139
I think that you are using wrongly the nlp.pipe command.
nlp.pipe is for parallelization which means that it processes simultaneously tweets. So, instead of giving to nlp.pipe command a single tweet as an argument, you should pass the tweets list.
The following code seems to achieve your goal:
import spacy
nlp = spacy.load('en')
tweets = ['This is a dummy tweet for stack overflow',
'What do we do with generator objects?']
spacy_tweets = nlp.pipe(tweets, batch_size = 10, n_threads = 3)
for tweet in spacy_tweets:
for token in tweet:
print(token.text, token.pos_)
Hope it helps!
Upvotes: 1