Aus_10
Aus_10

Reputation: 770

Is there a good way of saving a Spacy doc in a Pandas dataframe

I'm in the process of figuring this out but wanted to document on stack overflow since this wasn't easily searchable. (Also, hopefully someone can answer this before I do).

df.loc[:,'corpus_spacy_doc'] = df['text_corpus'].apply(lambda cell: nlp(cell))

So now I can do all sorts of nlp stuff to corpus_spacy_doc which is great. But I would like to have a good way of saving the state of this dataframe since df.to_csv() obviously won't work. Been looking to see if this is possible with parquet but I don't think it is.

As of right now it seems my best solution is using the spacy method of serializing the list of docs (https://spacy.io/usage/saving-loading) and loading with pandas dataframe later.

To summarize, I now want a pythonic way of doing something like

df.to_something(fname = fname)

Has anyone else gone through this or have a good answer?


Upvotes: 1

Views: 649

Answers (2)

Melcfrn
Melcfrn

Reputation: 7

I'm not sure I understand this paragraph and possibly my solution is the same:

As of right now it seems my best solution is using the spacy method of serializing the list of docs (https://spacy.io/usage/saving-loading) and loading with pandas dataframe later.

But if not, you can modify type of doc to save your dataframe to parquet (https://spacy.io/api/doc#to_bytes)

df['corpus_spacy_doc'] = df['corpus_spacy_doc'].apply(lambda x: x.to_bytes())
df.to_parquet(path, engine="pyarrow")

Upvotes: 0

Aus_10
Aus_10

Reputation: 770

So this was pretty easy and seems to solved for what I'm doing with regular df.to_pickle()

Upvotes: 1

Related Questions