Reputation: 8888
I noticed that column type for timestamp
in the parquet file generated by pandas.to_parquet
can be different depending on the version of pandas, e.g.
In [1]: pd.__version__
Out[1]: '1.0.5'
In [2]: pd.DataFrame([pd.Timestamp('2020-01-01')], columns=['a']).to_parquet('/tmp/test.parquet')
In [3]: !parquet-tools schema /tmp/test.parquet
message schema {
optional int64 a (TIMESTAMP_MILLIS);
}
In [4]: !parquet-tools head /tmp/test.parquet
a = 1577836800000
In [1]: pd.__version__
Out[1]: '1.1.2'
In [2]: pd.DataFrame([pd.Timestamp('2020-01-01')], columns=['a']).to_parquet('/tmp/test.parquet')
In [3]: !parquet-tools schema /tmp/test.parquet
message schema {
optional int64 a (TIMESTAMP_MICROS);
}
In [4]: !parquet-tools head /tmp/test.parquet
a = 1577836800000000
As seen above, pandas-1.0.5
converts the type of timestamp to be TIMESTAMP_MILLIS
while pandas-1.1.2
converts it to be TIMESTAMP_MICROS
.
I'm using pandas-1.1.2
, but I need the type to be TIMESTAMP_MILLIS
for downstream consumption of the parquet file (queried by Presto), how can do that, please?
I'm using the pyarrow engine.
Upvotes: 5
Views: 7022
Reputation: 4459
This is configurable with pyarrow
, luckily pd.to_parquet
sends any unknown kwrgs to the parquet library.
looking at pyarrow docs for ParquetWriter we find
coerce_timestamps (str, default None) – Cast timestamps a particular resolution. The defaults depends on version. For version='1.0' (the default), nanoseconds will be cast to microseconds (‘us’), and seconds to milliseconds (‘ms’) by default. For version='2.0', the original resolution is preserved and no casting is done by default. The casting might result in loss of data, in which case allow_truncated_timestamps=True can be used to suppress the raised exception. Valid values: {None, ‘ms’, ‘us’}
so that means you can force the timestamp to milliseconds with
df.to_parquet(path, coerce_timestamps="ms")
or to microseconds
df.to_parquet(path, coerce_timestamps="us")
which makes this the code that you need
pd.DataFrame([pd.Timestamp('2020-01-01')], columns=['a']).to_parquet('/tmp/test.parquet', coerce_timestamps="ms")
Take note of this part of the docs too
allow_truncated_timestamps=True
can be used to suppress the raised exception.
Upvotes: 8