spaCy PhraseMatcher running out of memory/utilizing 100% CPU

Question

I am trying to create a PhraseMatcher with 20 million patterns. For example:

terms = [''.join(random.choices(string.ascii_uppercase, k = 4)) for i in range(20000000)]
nlp = English()
matcher_large = PhraseMatcher(nlp.vocab, attr = 'LOWER')
terms_large = list(nlp.tokenizer.pipe(terms))
matcher_large.add('Terms', None, *terms_large)

This is causing the kernel to die in Jupyter, or the process to get killed in the terminal. It was also running at 100% CPU. Is there a less memory-intensive way to create this matcher? I thought about creating matchers in chunks, but I don't want to end up with hundreds of matchers.

spaCy PhraseMatcher running out of memory/utilizing 100% CPU

Answers (1)

Related Questions