Oct
Oct

Reputation: 179

How do I use generator objects in spaCy?

First experience with NLP here. I have about a half million tweets. I'm trying to use spacy to remove stop words, lemmatize, etc. and then pass the processed text to a classification model. Because of the size of the data I need multiprocessing to do this in reasonable speed, but can't figure out what to do with the generator object once I have it.

Here I load spacy and pass the data through the standard pipeline:

nlp = spacy.load('en')

tweets = ['This is a dummy tweet for stack overflow',
         'What do we do with generator objects?']
spacy_tweets = []
for tweet in tweets:
    doc_tweet = nlp.pipe(tweet, batch_size = 10, n_threads = 3)
    spacy_tweets.append(doc_tweet)

Now I'd like to take the Doc objects spaCy creates and then process the text with something like this:

def spacy_tokenizer(tweet):
    tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
    tweet = [tok for tok in tweet if (tok not in stopwords and tok not in punctuations)] 
    return tweet

But this doesn't work because spaCy returns generator objects when using the .pipe() method. So when I do this:

for tweet in spacy_tweets:
    print(tweet)

It prints the generator. Okay, I get that. But when I do this:

for tweet in spacy_tweets[0]:
    print(tweet)

I would expect it to print the Doc object or the text of the tweet in the generator but it doesn't do that. Instead it prints each character our individually.

Am I approaching this wrong or is there something I need to do in order to retrieve the Doc objects from the generator objects so I can use the spaCy attributes for lemmatizing etc.?

Upvotes: 2

Views: 2656

Answers (1)

gdaras
gdaras

Reputation: 10139

I think that you are using wrongly the nlp.pipe command.

nlp.pipe is for parallelization which means that it processes simultaneously tweets. So, instead of giving to nlp.pipe command a single tweet as an argument, you should pass the tweets list.

The following code seems to achieve your goal:

import spacy
nlp = spacy.load('en')

tweets = ['This is a dummy tweet for stack overflow',
         'What do we do with generator objects?']
spacy_tweets = nlp.pipe(tweets, batch_size = 10, n_threads = 3)

for tweet in spacy_tweets:
    for token in tweet:
        print(token.text, token.pos_)

Hope it helps!

Upvotes: 1

Related Questions