Annormaly long time to do .to_csv

Question

I encounter a problem I never had before.

I'm just trying to save a dataframe as a csv with .tocsv but after hours it is still running..

My dataframe is all the post from stackoverflow for the last year and the tags associated. I used a neural network : SentenceBert to embedd each posts as vector. The vector size for each post is 768.

So my final dataframe looks like that :

With 1 194 445 rows.

Is it because it's too big ? If so, is there any other solutions to save this dataframe as a csv ?

Thanks !

AKX · Accepted Answer

A text CSV file with 1.2 million rows, each containing, say, 512 bytes of other data and a 768-item embedding in text format (assuming each number takes about 12 bytes to print out, delimiters included)

>>> (768*12 + 512) * 1194445
11619560960

will be about 11 gigabytes. Writing that will take a while, and reading it in will take another long while.

Use a binary format, e.g. pickles via to_pickle() (or something more advanced if you feel like it) for data like this.

Annormaly long time to do .to_csv

Answers (1)

Related Questions