zyxue
zyxue

Reputation: 8888

How to control timestamp schema in pandas.to_parquet

I noticed that column type for timestamp in the parquet file generated by pandas.to_parquet can be different depending on the version of pandas, e.g.

In [1]: pd.__version__                                                                                                 
Out[1]: '1.0.5'


In [2]: pd.DataFrame([pd.Timestamp('2020-01-01')], columns=['a']).to_parquet('/tmp/test.parquet')                      

In [3]: !parquet-tools schema /tmp/test.parquet                                                                        
message schema {
  optional int64 a (TIMESTAMP_MILLIS);
}

In [4]: !parquet-tools head /tmp/test.parquet                                                                          
a = 1577836800000
In [1]: pd.__version__
Out[1]: '1.1.2'

In [2]: pd.DataFrame([pd.Timestamp('2020-01-01')], columns=['a']).to_parquet('/tmp/test.parquet')

In [3]: !parquet-tools schema /tmp/test.parquet
message schema {
  optional int64 a (TIMESTAMP_MICROS);
}

In [4]: !parquet-tools head /tmp/test.parquet                                                                          
a = 1577836800000000

As seen above, pandas-1.0.5 converts the type of timestamp to be TIMESTAMP_MILLIS while pandas-1.1.2 converts it to be TIMESTAMP_MICROS.

I'm using pandas-1.1.2, but I need the type to be TIMESTAMP_MILLIS for downstream consumption of the parquet file (queried by Presto), how can do that, please?

I'm using the pyarrow engine.

Upvotes: 5

Views: 7022

Answers (1)

Jimmar
Jimmar

Reputation: 4459

This is configurable with pyarrow, luckily pd.to_parquet sends any unknown kwrgs to the parquet library.
looking at pyarrow docs for ParquetWriter we find

coerce_timestamps (str, default None) – Cast timestamps a particular resolution. The defaults depends on version. For version='1.0' (the default), nanoseconds will be cast to microseconds (‘us’), and seconds to milliseconds (‘ms’) by default. For version='2.0', the original resolution is preserved and no casting is done by default. The casting might result in loss of data, in which case allow_truncated_timestamps=True can be used to suppress the raised exception. Valid values: {None, ‘ms’, ‘us’}

so that means you can force the timestamp to milliseconds with

df.to_parquet(path, coerce_timestamps="ms")

or to microseconds

df.to_parquet(path, coerce_timestamps="us")

which makes this the code that you need

pd.DataFrame([pd.Timestamp('2020-01-01')], columns=['a']).to_parquet('/tmp/test.parquet', coerce_timestamps="ms")

Take note of this part of the docs too

allow_truncated_timestamps=True can be used to suppress the raised exception.

Upvotes: 8

Related Questions