Write pandas DataFrame to a gzip csv without a timestamp on the archive

Question

Writing a pandas DataFrame to a gzip-compressed CSV adds the timestamp to the archive:

import pandas as pd
df = pd.DataFrame({'a': [1]})
df.to_csv('df.csv.gz', compression='gzip')
# Timestamp is the large number per https://unix.stackexchange.com/a/79546/88807.
!



I'd like to write it without the timestamp, such that two subsequent exports of the same DataFrame are identical:

df.to_csv('df2.csv.gz', compression='gzip')
import filecmp
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# False

Sam Mason · Accepted Answer

After looking through the Pandas code for CSV writing, the best I can suggest is to use the gzip module directly. that way you can set the mtime attribute directly which seems to be what you want:

import pandas as pd
from gzip import GzipFile
from io import TextIOWrapper

def to_gzip_csv_no_timestamp(df, f, *kwargs):
    # Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
    # header, using GzipFile and TextIOWrapper.
    #
    # Args:
    #     df: pandas DataFrame.
    #     f: Filename string ending in .csv (not .csv.gz).
    #     *kwargs: Other arguments passed to to_csv().
    #
    # Returns:
    #     Nothing.
    with TextIOWrapper(GzipFile(f, 'w', mtime=0), encoding='utf-8') as fd:
        df.to_csv(fd, *kwargs)

to_gzip_csv_no_timestamp(df, 'df.csv.gz')
to_gzip_csv_no_timestamp(df, 'df2.csv.gz')

filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True

This outperforms the two-step subprocess approach below for this tiny dataset:

%timeit to_gzip_csv_no_timestamp(df, 'df.csv')                                                                                                                                                                                                                                    
693 us +- 14.6 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

%timeit to_gzip_csv_no_timestamp_subprocess(df, 'df.csv')
10.2 ms +- 167 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

I'm using a TextIOWrapper() to handle converting strings to bytes as Pandas does, but you could also do this if you know you're not going to be saving much data:

with GzipFile('df.csv.gz', 'w', mtime=0) as fd:
    fd.write(df.to_csv().encode('utf-8'))

Note that gzip -lv df.csv.gz still shows the "current time" but it's just pulling this from the inode's mtime. dumping with hexdump -C shows the value is saved in the file, and changing the files mtime (with touch -mt 0711171533 df.csv.gz) causes gzip to display a different value

Also note that the original "filename" is also part of the gzipped file, so you'll need to write to the same name (or also override this) to make it deterministic.

Write pandas DataFrame to a gzip csv without a timestamp on the archive

Answers (2)

Related Questions