Reputation: 359
Before preprocessing my data with spaCy, I typically have my data stored in a Pandas Series. Since I'd like to preserve the index for each document before serializing my Docs, I decided to use the extension attribute. However, I noted a dramatic increase in the memory usage until my system runs out of memory. I'm not sure what I might be doing wrong.
Here is how I added the extension after initializing the Language class and adding the extension with Doc.set_extension("idx", default=None)
. I run nlp.pipe
on my text and add the extension idx
to each Doc:
def stream_text_series(series):
data = ((text, {"idx": str(idx)}) for idx, text in series.items())
for doc, context in self.nlp.pipe(
data, as_tuples=True
):
doc._.idx = context["idx"]
yield doc
And when saving my data as a DocBin, I create the DocBin with store_user_data=True
in order to save my extension:
def convert_text_series_to_docs_and_serialize(series):
doc_bin = DocBin(store_user_data=True)
for doc in stream_text_series(series):
doc_bin.add(doc)
return doc_bin.to_bytes()
Question: Am I implementing the extension feature incorrectly? Any thoughts of how I might proceed? Any suggestions are more than welcome!
Upvotes: 0
Views: 68