TR517
TR517

Reputation: 341

How to speed up spaCy lemmatization?

I'm using spaCy (version 2.0.11) for lemmatization in the first step of my NLP pipeline but unfortunately it's taking a verrry long time. It is clearly the slowest part of my processing pipeline and I want to know if there are improvements I could be making. I am using a pipeline as:

nlp.pipe(docs_generator, batch_size=200, n_threads=6, disable=['ner'])

on a 8 core machine, and I have verified that the machine is using all the cores.

On a corpus of about 3 million short texts totaling almost 2gb it takes close to 24hrs to lemmatize and write to disk. Reasonable?

I have tried disabling a couple parts of the processing pipeline and found that it broke the lemmatization (parser, tagger).

Are there any parts of the default processing pipeline that are not required for lemmatization besides named entity recognition?

Are there other ways of speeding up the spaCy lemmatization process?

Aside:

It also appears that documentation doesn't list all the operations in the parsing pipeline. At the top of the spacy Language class we have:

factories = {
    'tokenizer': lambda nlp: nlp.Defaults.create_tokenizer(nlp),
    'tensorizer': lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
    'tagger': lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
    'parser': lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
    'ner': lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
    'similarity': lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
    'textcat': lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
    'sbd': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
    'sentencizer': lambda nlp, **cfg: SentenceSegmenter(nlp.vocab, **cfg),
    'merge_noun_chunks': lambda nlp, **cfg: merge_noun_chunks,
    'merge_entities': lambda nlp, **cfg: merge_entities
}

which includes some items not covered in the documentation here: https://spacy.io/usage/processing-pipelines

Since they are not covered I don't really know which may be disabled, nor what their dependencies are.

Upvotes: 13

Views: 7281

Answers (2)

TR517
TR517

Reputation: 341

I found out you can disable the parser portion of the spacy pipeline as well, as long as you add the sentence segmenter. It's not crazy fast but it is definitely an improvement--in tests the time looks to be about 1/3 of what I was doing before (when I was just disabling 'ner'). Here is what I have now:

nlp = spacy.load('en', disable=['ner', 'parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

Upvotes: 11

DhruvPathak
DhruvPathak

Reputation: 43265

  • One quick and impactful optimization would be memoization using suitable inmemory structures or inmemory databases ( python dicts or redis/memcache ).
  • Lemmatized forms of the words alongside with their context like Part-Of-Speech would be constant, and do not change, so there is no need to spend computing power on them again and again.

  • There would be LOADS of repetitions in your 3 million text corpus, and memoization would cut down the time hugely.

Example:

>>> import spacy
>>> nlp = spacy.load('en')
>>> txt1 = u"he saw the dragon, he saw the forest, and used a saw to cut the tree, and then threw the saw in the river." 
>>> [(x.text,x.pos_,x.lemma_) for x in nlp(txt1)]
[(u'he', u'PRON', u'-PRON-'), (u'saw', u'VERB', u'see'), (u'the', u'DET', u'the'), (u'dragon', u'NOUN', u'dragon'), (u',', u'PUNCT', u','), (u'he', u'PRON', u'-PRON-'), (u'saw', u'VERB', u'see'), (u'the', u'DET', u'the'), (u'forest', u'NOUN', u'forest'), (u',', u'PUNCT', u','), (u'and', u'CCONJ', u'and'), (u'used', u'VERB', u'use'), (u'a', u'DET', u'a'), (u'saw', u'NOUN', u'saw'), (u'to', u'PART', u'to'), (u'cut', u'VERB', u'cut'), (u'the', u'DET', u'the'), (u'tree', u'NOUN', u'tree'), (u',', u'PUNCT', u','), (u'and', u'CCONJ', u'and'), (u'then', u'ADV', u'then'), (u'threw', u'VERB', u'throw'), (u'the', u'DET', u'the'), (u'saw', u'NOUN', u'saw'), (u'in', u'ADP', u'in'), (u'the', u'DET', u'the'), (u'river', u'NOUN', u'river'), (u'.', u'PUNCT', u'.')]

As you can see pos tag + lemmatized form are constant.

Upvotes: 0

Related Questions