Reputation: 34499
I have a very simple list comprehension I would like to parallelize:
nlp = spacy.load(model)
texts = sorted(X['text'])
# TODO: Parallelize
docs = [nlp(text) for text in texts]
However, when I try using Pool
from the multiprocessing
module like so:
docs = Pool().map(nlp, texts)
It gives me the following error:
Traceback (most recent call last):
File "main.py", line 117, in <module>
main()
File "main.py", line 99, in main
docs = parse_docs(X)
File "main.py", line 81, in parse_docs
docs = Pool().map(nlp, texts)
File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 608, in get
raise self._value
File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 385, in _handle_tasks
put(task)
File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'
Is it possible to do this parallel computation without having to make objects pickleable? I'm open to examples tied to third-party libraries such as joblib
, etc.
edit: I also tried
docs = Pool().map(nlp.__call__, texts)
and that didn't work either.
Upvotes: 0
Views: 1990
Reputation: 6156
A workaround could be the follows
texts = ["Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season.",
"The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title.",
"The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.",
"As this was the 50th Super Bowl, the league emphasized the"]
def init():
global nlp
nlp = spacy.load('en')
def func(text):
global nlp
return nlp(text)
with mp.Pool(initializer=init) as pool:
docs = pool.map(func, texts)
which outputs
for doc in docs:
print(list(w.text for w in doc))
['Super', 'Bowl', '50', 'was', 'an', 'American', 'football', 'game', 'to', 'determine', 'the', 'champion', 'of', 'the', 'National', 'Football', 'League', '(', 'NFL', ')', 'for', 'the', '2015', 'season', '.']
['The', 'American', 'Football', 'Conference', '(', 'AFC', ')', 'champion', 'Denver', 'Broncos', 'defeated', 'the', 'National', 'Football', 'Conference', '(', 'NFC', ')', 'champion', 'Carolina', 'Panthers', '24–10', 'to', 'earn', 'their', 'third', 'Super', 'Bowl', 'title', '.']
['The', 'game', 'was', 'played', 'on', 'February', '7', ',', '2016', ',', 'at', 'Levi', "'s", 'Stadium', 'in', 'the', 'San', 'Francisco', 'Bay', 'Area', 'at', 'Santa', 'Clara', ',', 'California', '.']
['As', 'this', 'was', 'the', '50th', 'Super', 'Bowl', ',', 'the', 'league', 'emphasized', 'the']
Upvotes: 0
Reputation: 5784
Most likely not. You're probably trying to share something that's at a lower level unsafe to share across processes, e.g. something with open file descriptors. There's some discussion here on why it's not picklable, and they vaguely say it's for something like that reason. Why not load nlp
separately in each process?
More here too, seems to be a general issue with spacy that they're working on resolving: https://github.com/explosion/spaCy/issues/1045
Upvotes: 1