formicaman
formicaman

Reputation: 1357

spaCy PhraseMatcher running out of memory/utilizing 100% CPU

I am trying to create a PhraseMatcher with 20 million patterns. For example:

terms = [''.join(random.choices(string.ascii_uppercase, k = 4)) for i in range(20000000)]
nlp = English()
matcher_large = PhraseMatcher(nlp.vocab, attr = 'LOWER')
terms_large = list(nlp.tokenizer.pipe(terms))
matcher_large.add('Terms', None, *terms_large)

This is causing the kernel to die in Jupyter, or the process to get killed in the terminal. It was also running at 100% CPU. Is there a less memory-intensive way to create this matcher? I thought about creating matchers in chunks, but I don't want to end up with hundreds of matchers.

Upvotes: 1

Views: 209

Answers (1)

aab
aab

Reputation: 11484

It's true that the PhraseMatcher may not be the best choice this many patterns, but you can add patterns incrementally rather than creating a huge list up front and passing a likewise huge number of args at once to the add method:

for doc in nlp.tokenizer.pipe(terms):
    matcher.add("Terms", [doc]) # newer API

Jupyter notebooks often have a relatively low default memory limit, which is probably what you're running into.

Upvotes: 1

Related Questions