Reputation: 142
Below I am showing few entries of my dataframe. My (each) dataframe has millions row.
import pandas as pd
data = [{'stamp':'12/31/2020 9:35:42 AM', 'value': 21.99, 'trigger': True},
{'stamp':'12/31/2020 10:35:42 AM', 'value': 22.443, 'trigger': False},
{'stamp':'12/31/2020 11:35:42 AM', 'value': 19.00, 'trigger': False},
{'stamp':'12/31/2020 9:45:42 AM', 'value': 45.02, 'trigger': False},
{'stamp':'12/31/2020 9:55:42 AM', 'value': 48, 'trigger': False},
{'stamp':'12/31/2020 11:35:42 AM', 'value': 48.99, 'trigger': False}]
df = pd.DataFrame(data)
Below is how few ways I can save:
df.to_parquet('df.parquet', compression = 'gzip')
df.to_csv('df.csv')
I don't see much improvement in to_parquet
as compared to to_csv
. I wish to minimize the file size on hard drive. Is there any way out?
Upvotes: 1
Views: 5555
Reputation: 2541
parquet gives you compression over a column when that columns (e.g.) has many continuous sequences of the same value. (See wiki for more) From your example data only trigger
shows a sign of that, but the improvement may not be large because it was not the one taking up most space in the first place.
Saving integer is cheaper than saving a long string, so you may consider to change your stamp
from str
into timestamp value which is int
, by doing this
df['stamp'] = pd.to_datetime(df['stamp']).values.astype(np.int64) // 10**9
We divide it by 10**9
because your stamp
appears to be precise to the level of second only, instead of nanosecond which is the default.
but you will need to convert it back to the readable str
form the next time you read the saved data, by
df['stamp'] = pd.to_datetime(df['stamp'] * 10**9)
Upvotes: 1