falsum
falsum

Reputation: 359

Memory usage when using spaCy Doc extensions

Issue

Before preprocessing my data with spaCy, I typically have my data stored in a Pandas Series. Since I'd like to preserve the index for each document before serializing my Docs, I decided to use the extension attribute. However, I noted a dramatic increase in the memory usage until my system runs out of memory. I'm not sure what I might be doing wrong.

Here is how I added the extension after initializing the Language class and adding the extension with Doc.set_extension("idx", default=None). I run nlp.pipe on my text and add the extension idx to each Doc:

    def stream_text_series(series):
        data = ((text, {"idx": str(idx)}) for idx, text in series.items())
        for doc, context in self.nlp.pipe(
            data, as_tuples=True
        ):
            doc._.idx = context["idx"]
            yield doc

And when saving my data as a DocBin, I create the DocBin with store_user_data=True in order to save my extension:

    def convert_text_series_to_docs_and_serialize(series):
        doc_bin = DocBin(store_user_data=True)
        for doc in stream_text_series(series):
            doc_bin.add(doc)
        return doc_bin.to_bytes()

Question: Am I implementing the extension feature incorrectly? Any thoughts of how I might proceed? Any suggestions are more than welcome!

Further details

Upvotes: 0

Views: 68

Answers (0)

Related Questions