Memory usage when using spaCy Doc extensions

Question

Issue

Before preprocessing my data with spaCy, I typically have my data stored in a Pandas Series. Since I'd like to preserve the index for each document before serializing my Docs, I decided to use the extension attribute. However, I noted a dramatic increase in the memory usage until my system runs out of memory. I'm not sure what I might be doing wrong.

Here is how I added the extension after initializing the Language class and adding the extension with Doc.set_extension("idx", default=None). I run nlp.pipe on my text and add the extension idx to each Doc:

    def stream_text_series(series):
        data = ((text, {"idx": str(idx)}) for idx, text in series.items())
        for doc, context in self.nlp.pipe(
            data, as_tuples=True
        ):
            doc._.idx = context["idx"]
            yield doc

And when saving my data as a DocBin, I create the DocBin with store_user_data=True in order to save my extension:

    def convert_text_series_to_docs_and_serialize(series):
        doc_bin = DocBin(store_user_data=True)
        for doc in stream_text_series(series):
            doc_bin.add(doc)
        return doc_bin.to_bytes()

Question: Am I implementing the extension feature incorrectly? Any thoughts of how I might proceed? Any suggestions are more than welcome!

Further details

Used language model: "en_core_web_trf".
Memory usage: when I serialize my data without using extensions, my system uses about 3GB of RAM. With the extension, it uses all my available RAM (about 26GB).
I ran the code in a fresh conda environment using the installation intructions on the spaCy website.
The problem occurs whether I use the CPU or the GPU.

Memory usage when using spaCy Doc extensions

Issue

Further details

Answers (0)

Related Questions