handling large timestamps when converting from pyarrow.Table to pandas

Question

I have a timestamp of 9999-12-31 23:59:59 stored in a parquet file as an int96. I read this parquet file using pyarrow.dataset and convert the resulting table into a pandas dataframe (using pyarrow.Table.to_pandas()). The conversion to pandas dataframe turns my timestamp into 1816-03-30 05:56:07.066277376 (Pandas timestamp has probably smaller range of valid dates) without any complaing about datatype or anything.

I then take this pandas dataframe, convert it back to table and write it into a parquet dataset using pyarrow.dataset.write_dataset

I am now left with a different data than the data i started with, without seeing any warnings. (I found this out when I tried to create an impala table from the parquet dataset and then couldn't query it properly).

Is there a way to handle these large timestamps when converting from pyarrow table to pandas?

I've tried using the timestamp_as_object = True parameter as in Table.to_pandas(timestamp_as_object = True), but doesn't seem like it does anything.

EDIT: providing reproducible example. The problem is that pyarrow thinks the timestamps are nanoseconds while reading the file, although they were stored as microseconds:

import pyarrow as pa
import pyarrow.dataset as ds
non_legacy_hdfs_filesystem = # connect to a filesystem here
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = True, coerce_timestamps = 'us', allow_truncated_timestamps = True)
ds.write_dataset(data = my_table, base_dir = 'my_path', filesystem = non_legacy_hdfs_filesystem, format = parquet_format, file_options = write_options, partitioning= None)

dataset = ds.dataset('my_path', filesystem = non_legacy_hdfs_filesystem)
dataset.to_table().column('my_timestamps')

handling large timestamps when converting from pyarrow.Table to pandas

Answers (1)

Related Questions