Reputation: 770
I'm in the process of figuring this out but wanted to document on stack overflow since this wasn't easily searchable. (Also, hopefully someone can answer this before I do).
df.loc[:,'corpus_spacy_doc'] = df['text_corpus'].apply(lambda cell: nlp(cell))
So now I can do all sorts of nlp stuff to corpus_spacy_doc which is great. But I would like to have a good way of saving the state of this dataframe since df.to_csv() obviously won't work. Been looking to see if this is possible with parquet but I don't think it is.
As of right now it seems my best solution is using the spacy method of serializing the list of docs (https://spacy.io/usage/saving-loading) and loading with pandas dataframe later.
To summarize, I now want a pythonic way of doing something like
df.to_something(fname = fname)
Has anyone else gone through this or have a good answer?
Upvotes: 1
Views: 649
Reputation: 7
I'm not sure I understand this paragraph and possibly my solution is the same:
As of right now it seems my best solution is using the spacy method of serializing the list of docs (https://spacy.io/usage/saving-loading) and loading with pandas dataframe later.
But if not, you can modify type of doc to save your dataframe to parquet (https://spacy.io/api/doc#to_bytes)
df['corpus_spacy_doc'] = df['corpus_spacy_doc'].apply(lambda x: x.to_bytes())
df.to_parquet(path, engine="pyarrow")
Upvotes: 0
Reputation: 770
So this was pretty easy and seems to solved for what I'm doing with regular df.to_pickle()
Upvotes: 1