ira
ira

Reputation: 2644

handling large timestamps when converting from pyarrow.Table to pandas

I have a timestamp of 9999-12-31 23:59:59 stored in a parquet file as an int96. I read this parquet file using pyarrow.dataset and convert the resulting table into a pandas dataframe (using pyarrow.Table.to_pandas()). The conversion to pandas dataframe turns my timestamp into 1816-03-30 05:56:07.066277376 (Pandas timestamp has probably smaller range of valid dates) without any complaing about datatype or anything.

I then take this pandas dataframe, convert it back to table and write it into a parquet dataset using pyarrow.dataset.write_dataset

I am now left with a different data than the data i started with, without seeing any warnings. (I found this out when I tried to create an impala table from the parquet dataset and then couldn't query it properly).

Is there a way to handle these large timestamps when converting from pyarrow table to pandas?

I've tried using the timestamp_as_object = True parameter as in Table.to_pandas(timestamp_as_object = True), but doesn't seem like it does anything.

EDIT: providing reproducible example. The problem is that pyarrow thinks the timestamps are nanoseconds while reading the file, although they were stored as microseconds:

import pyarrow as pa
import pyarrow.dataset as ds
non_legacy_hdfs_filesystem = # connect to a filesystem here
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = True, coerce_timestamps = 'us', allow_truncated_timestamps = True)
ds.write_dataset(data = my_table, base_dir = 'my_path', filesystem = non_legacy_hdfs_filesystem, format = parquet_format, file_options = write_options, partitioning= None)

dataset = ds.dataset('my_path', filesystem = non_legacy_hdfs_filesystem)
dataset.to_table().column('my_timestamps')

Upvotes: 3

Views: 3718

Answers (1)

0x26res
0x26res

Reputation: 13932

My understanding is that your data has been saved using use_deprecated_int96_timestamps=True.

import pyarrow as pa
import pyarrow.parquet as pq


my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
pq.write_table(my_table, '/tmp/table.pq',  use_deprecated_int96_timestamps=True)

In this mode, timestamps are saved using 96 bits integer with a (default/hardcoded) nanosecond resolution.

>>> pq.read_metadata('/tmp/table.pq').schema[0]
<ParquetColumnSchema>
  name: my_timestamps
  path: my_timestamps
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT96
  logical_type: None
  converted_type (legacy): NONE

In the latest version of arrow/parquet, timestamps are 64 bits integers with a configurable resolution.

It should be possible to convert the legacy 96 bits nano second timestamps to 64 bits integer using microsecond resolution without loss of information. But unfortunately there's no option in the parquet reader that would let you do that (as far as I can tell).

You may have to raise an issue with parquet/arrow, but I think they are trying hard to deprecate 96 bits integerenter link description here.

Upvotes: 2

Related Questions