Reputation: 1357
I am trying to create a PhraseMatcher
with 20 million patterns. For example:
terms = [''.join(random.choices(string.ascii_uppercase, k = 4)) for i in range(20000000)]
nlp = English()
matcher_large = PhraseMatcher(nlp.vocab, attr = 'LOWER')
terms_large = list(nlp.tokenizer.pipe(terms))
matcher_large.add('Terms', None, *terms_large)
This is causing the kernel to die in Jupyter, or the process to get killed in the terminal. It was also running at 100% CPU. Is there a less memory-intensive way to create this matcher? I thought about creating matchers in chunks, but I don't want to end up with hundreds of matchers.
Upvotes: 1
Views: 209
Reputation: 11484
It's true that the PhraseMatcher
may not be the best choice this many patterns, but you can add patterns incrementally rather than creating a huge list up front and passing a likewise huge number of args at once to the add
method:
for doc in nlp.tokenizer.pipe(terms):
matcher.add("Terms", [doc]) # newer API
Jupyter notebooks often have a relatively low default memory limit, which is probably what you're running into.
Upvotes: 1