Reputation: 15813
Writing a pandas DataFrame to a gzip-compressed CSV adds the timestamp to the archive:
import pandas as pd
df = pd.DataFrame({'a': [1]})
df.to_csv('df.csv.gz', compression='gzip')
# Timestamp is the large number per https://unix.stackexchange.com/a/79546/88807.
!<df.csv.gz dd bs=4 skip=1 count=1 | od -t d4
# 1+0 records in
# 1+0 records out
# 4 bytes copied, 5.6233e-05 s, 71.1 kB/s
# 0000000 1546978755
# 0000004df.csv
I'd like to write it without the timestamp, such that two subsequent exports of the same DataFrame are identical:
df.to_csv('df2.csv.gz', compression='gzip')
import filecmp
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# False
Upvotes: 0
Views: 569
Reputation: 16184
After looking through the Pandas code for CSV writing, the best I can suggest is to use the gzip
module directly. that way you can set the mtime
attribute directly which seems to be what you want:
import pandas as pd
from gzip import GzipFile
from io import TextIOWrapper
def to_gzip_csv_no_timestamp(df, f, *kwargs):
# Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
# header, using GzipFile and TextIOWrapper.
#
# Args:
# df: pandas DataFrame.
# f: Filename string ending in .csv (not .csv.gz).
# *kwargs: Other arguments passed to to_csv().
#
# Returns:
# Nothing.
with TextIOWrapper(GzipFile(f, 'w', mtime=0), encoding='utf-8') as fd:
df.to_csv(fd, *kwargs)
to_gzip_csv_no_timestamp(df, 'df.csv.gz')
to_gzip_csv_no_timestamp(df, 'df2.csv.gz')
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True
This outperforms the two-step subprocess
approach below for this tiny dataset:
%timeit to_gzip_csv_no_timestamp(df, 'df.csv')
693 us +- 14.6 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)
%timeit to_gzip_csv_no_timestamp_subprocess(df, 'df.csv')
10.2 ms +- 167 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
I'm using a TextIOWrapper()
to handle converting strings to bytes as Pandas does, but you could also do this if you know you're not going to be saving much data:
with GzipFile('df.csv.gz', 'w', mtime=0) as fd:
fd.write(df.to_csv().encode('utf-8'))
Note that gzip -lv df.csv.gz
still shows the "current time" but it's just pulling this from the inode's mtime. dumping with hexdump -C
shows the value is saved in the file, and changing the files mtime (with touch -mt 0711171533 df.csv.gz
) causes gzip
to display a different value
Also note that the original "filename" is also part of the gzipped file, so you'll need to write to the same name (or also override this) to make it deterministic.
Upvotes: 1
Reputation: 15813
You can export as an uncompressed CSV and then call gzip
with the -n
flag to avoid timestamping (this is also an instruction to not save the file name in the archive):
import subprocess
def to_gzip_csv_no_timestamp_subprocess(df, f, *kwargs):
# Write pandas DataFrame to a .csv.gz file, without a timestamp in the archive
# header.
# Args:
# df: pandas DataFrame.
# f: Filename string ending in .csv (not .csv.gz).
# *kwargs: Other arguments passed to to_csv().
# Returns:
# Nothing.
import subprocess
df.to_csv(f, *kwargs)
# -n for the timestamp, -f to overwrite.
subprocess.check_call(['gzip', '-nf', f])
to_gzip_csv_no_timestamp(df, 'df.csv')
to_gzip_csv_no_timestamp(df, 'df2.csv')
filecmp.cmp('df.csv.gz', 'df2.csv.gz')
# True
Upvotes: 0