Max Ghenis
Max Ghenis

Reputation: 15813

Write pandas DataFrame to a gzip csv without a timestamp on the archive

Writing a pandas DataFrame to a gzip-compressed CSV adds the timestamp to the archive:

import pandas as pd
df = pd.DataFrame({'a': [1]})
df.to_csv('df.csv.gz', compression='gzip')
# Timestamp is the large number per https://unix.stackexchange.com/a/79546/88807.
!<df.csv.gz dd bs=4 skip=1 count=1 | od -t d4
# 1+0 records in
# 1+0 records out
# 4 bytes copied, 5.6233e-05 s, 71.1 kB/s
# 0000000  1546978755
# 0000004df.csv

I'd like to write it without the timestamp, such that two subsequent exports of the same DataFrame are identical:

df.to_csv('df2.csv.gz', compression='gzip')
import filecmp
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# False

Upvotes: 0

Views: 569

Answers (2)

Sam Mason
Sam Mason

Reputation: 16184

After looking through the Pandas code for CSV writing, the best I can suggest is to use the gzip module directly. that way you can set the mtime attribute directly which seems to be what you want:

import pandas as pd
from gzip import GzipFile
from io import TextIOWrapper

def to_gzip_csv_no_timestamp(df, f, *kwargs):
    # Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
    # header, using GzipFile and TextIOWrapper.
    #
    # Args:
    #     df: pandas DataFrame.
    #     f: Filename string ending in .csv (not .csv.gz).
    #     *kwargs: Other arguments passed to to_csv().
    #
    # Returns:
    #     Nothing.
    with TextIOWrapper(GzipFile(f, 'w', mtime=0), encoding='utf-8') as fd:
        df.to_csv(fd, *kwargs)

to_gzip_csv_no_timestamp(df, 'df.csv.gz')
to_gzip_csv_no_timestamp(df, 'df2.csv.gz')

filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True

This outperforms the two-step subprocess approach below for this tiny dataset:

%timeit to_gzip_csv_no_timestamp(df, 'df.csv')                                                                                                                                                                                                                                    
693 us +- 14.6 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

%timeit to_gzip_csv_no_timestamp_subprocess(df, 'df.csv')
10.2 ms +- 167 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

I'm using a TextIOWrapper() to handle converting strings to bytes as Pandas does, but you could also do this if you know you're not going to be saving much data:

with GzipFile('df.csv.gz', 'w', mtime=0) as fd:
    fd.write(df.to_csv().encode('utf-8'))

Note that gzip -lv df.csv.gz still shows the "current time" but it's just pulling this from the inode's mtime. dumping with hexdump -C shows the value is saved in the file, and changing the files mtime (with touch -mt 0711171533 df.csv.gz) causes gzip to display a different value

Also note that the original "filename" is also part of the gzipped file, so you'll need to write to the same name (or also override this) to make it deterministic.

Upvotes: 1

Max Ghenis
Max Ghenis

Reputation: 15813

You can export as an uncompressed CSV and then call gzip with the -n flag to avoid timestamping (this is also an instruction to not save the file name in the archive):

import subprocess

def to_gzip_csv_no_timestamp_subprocess(df, f, *kwargs):
    # Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
    # header.
    # Args:
    #     df: pandas DataFrame.
    #     f: Filename string ending in .csv (not .csv.gz).
    #     *kwargs: Other arguments passed to to_csv().
    # Returns:
    #     Nothing.
    import subprocess
    df.to_csv(f, *kwargs)
    # -n for the timestamp, -f to overwrite.
    subprocess.check_call(['gzip', '-nf', f])

to_gzip_csv_no_timestamp(df, 'df.csv')
to_gzip_csv_no_timestamp(df, 'df2.csv')
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True

Upvotes: 0

Related Questions