mgab
mgab

Reputation: 3994

Is there any cross-tool way to use Interval type in Parquet format?

Problem

One of the logical types defined parquet file format specification is Interval, to represent time intervals (a.k.a. durations, time deltas, and so...). Here's part of what the documentation says about it:

INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date.

However, when trying to write and read parquet files from different tools, this type seems to give quite a few problems.

Exploration

Interoperability

I tested the interoperability between DuckDB, Pandas using either FastParquet or PyArrow engines, and PySpark. That is, for each tool I created some simple dataset with a duration value, dumped it to a parquet file, and tried to load it with each tool. The code is here, and these are the resulting values and types (or errors) for each combination.

v write \\\ read > DuckDB Pandas-FastParquet Pandas-Pyarrow PySpark
DuckDB 0 days 00:00:30
timedelta64[ns]
ValueError: could not broadcast input array from shape (3,) into shape (1,) b'\x00\x00\x00\x00\x00\x00\x00\x000u\x00\x00'
<class 'bytes'>
Py4JJavaError: An error occurred while calling o84.parquet.
Pandas-FastParquet 00:00:30
<class 'datetime.time'>
0 days 00:00:30
timedelta64[ns]
00:00:30
<class 'datetime.time'>
Py4JJavaError: An error occurred while calling o84.parquet.
Pandas-Pyarrow 30000000000
int64
30000000000
int64
0 days 00:00:30
timedelta64[ns]
30000000000
int64
PySpark 30000000
int64
30000000
int64
30000000
int64
0 days 00:00:30
timedelta64[ns]

So it seems that the parquet files written by each tool can only be read correctly with the same tool and there's zero interoperability. Notice that datetime.time represents a time of the day instead of a time interval, which should be datetime.timedelta.

Schema

Checking with the go tool hangxie/parquet-tools (using the python ktrueda/parquet-tools gives similar results) it seems that only DuckDB tries to follow the specification, as far as I understand:

❯ parquet-tools schema /tmp/test-DuckDB.parquet
{"Tag":"name=Duckdb_schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT32, convertedtype=INT_32, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, length=12, repetitiontype=OPTIONAL"}
]}

❯ parquet-tools schema /tmp/test-Pandas-FastParquet.parquet
{"Tag":"name=Schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, convertedtype=TIME_MICROS, repetitiontype=OPTIONAL"}
]}

❯ parquet-tools schema /tmp/test-Pandas-Pyarrow.parquet
{"Tag":"name=Schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, repetitiontype=OPTIONAL"}]}

❯ parquet-tools schema /tmp/test-PySpark.parquet
{"Tag":"name=Spark_schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, repetitiontype=OPTIONAL"}
]}

Question

Am I missing something? Is there any plan to support Interval type in the different tools or any decision to omit it? Otherwise, I guess issues should be created for the different tools (except DuckDB) to request implementing that part of the specification, I didn't find any.

And on a more practical level, is there any practical solution other than representing intervals with integer or float types and specifying the unit in the column name?

Upvotes: 1

Views: 113

Answers (0)

Related Questions