user13744439
user13744439

Reputation: 142

How to compress pandas dataframe

Below I am showing few entries of my dataframe. My (each) dataframe has millions row.

import pandas as pd

data = [{'stamp':'12/31/2020 9:35:42 AM', 'value': 21.99, 'trigger': True}, 
        {'stamp':'12/31/2020 10:35:42 AM', 'value': 22.443, 'trigger': False}, 
        {'stamp':'12/31/2020 11:35:42 AM', 'value': 19.00, 'trigger': False}, 
        {'stamp':'12/31/2020 9:45:42 AM', 'value': 45.02, 'trigger': False}, 
        {'stamp':'12/31/2020 9:55:42 AM', 'value': 48, 'trigger': False}, 
        {'stamp':'12/31/2020 11:35:42 AM', 'value': 48.99, 'trigger': False}]
df = pd.DataFrame(data)

Below is how few ways I can save:

df.to_parquet('df.parquet', compression = 'gzip')
df.to_csv('df.csv')

I don't see much improvement in to_parquet as compared to to_csv. I wish to minimize the file size on hard drive. Is there any way out?

Upvotes: 1

Views: 5555

Answers (1)

Raymond Kwok
Raymond Kwok

Reputation: 2541

parquet gives you compression over a column when that columns (e.g.) has many continuous sequences of the same value. (See wiki for more) From your example data only trigger shows a sign of that, but the improvement may not be large because it was not the one taking up most space in the first place.

Saving integer is cheaper than saving a long string, so you may consider to change your stamp from str into timestamp value which is int, by doing this

df['stamp'] = pd.to_datetime(df['stamp']).values.astype(np.int64) // 10**9

We divide it by 10**9 because your stamp appears to be precise to the level of second only, instead of nanosecond which is the default.

but you will need to convert it back to the readable str form the next time you read the saved data, by

df['stamp'] = pd.to_datetime(df['stamp'] * 10**9)

Upvotes: 1

Related Questions