Reputation: 3994
One of the logical types defined parquet file format specification is Interval
, to represent time intervals (a.k.a. durations, time deltas, and so...). Here's part of
what the documentation says about it:
INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date.
However, when trying to write and read parquet files from different tools, this type seems to give quite a few problems.
I tested the interoperability between DuckDB, Pandas using either FastParquet or PyArrow engines, and PySpark. That is, for each tool I created some simple dataset with a duration value, dumped it to a parquet file, and tried to load it with each tool. The code is here, and these are the resulting values and types (or errors) for each combination.
v write \\\ read > | DuckDB | Pandas-FastParquet | Pandas-Pyarrow | PySpark |
---|---|---|---|---|
DuckDB | 0 days 00:00:30 timedelta64[ns] |
ValueError: could not broadcast input array from shape (3,) into shape (1,) | b'\x00\x00\x00\x00\x00\x00\x00\x000u\x00\x00' <class 'bytes'> |
Py4JJavaError: An error occurred while calling o84.parquet. |
Pandas-FastParquet | 00:00:30 <class 'datetime.time'> |
0 days 00:00:30 timedelta64[ns] |
00:00:30 <class 'datetime.time'> |
Py4JJavaError: An error occurred while calling o84.parquet. |
Pandas-Pyarrow | 30000000000 int64 |
30000000000 int64 |
0 days 00:00:30 timedelta64[ns] |
30000000000 int64 |
PySpark | 30000000 int64 |
30000000 int64 |
30000000 int64 |
0 days 00:00:30 timedelta64[ns] |
So it seems that the parquet files written by each tool can only be read correctly with the same tool and there's zero interoperability. Notice that datetime.time
represents a time of the day instead of a time interval, which should be datetime.timedelta
.
Checking with the go tool hangxie/parquet-tools (using the python ktrueda/parquet-tools gives similar results) it seems that only DuckDB tries to follow the specification, as far as I understand:
❯ parquet-tools schema /tmp/test-DuckDB.parquet
{"Tag":"name=Duckdb_schema",
"Fields":[
{"Tag":"name=Seconds, type=INT32, convertedtype=INT_32, repetitiontype=OPTIONAL"},
{"Tag":"name=Duration, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, length=12, repetitiontype=OPTIONAL"}
]}
❯ parquet-tools schema /tmp/test-Pandas-FastParquet.parquet
{"Tag":"name=Schema",
"Fields":[
{"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
{"Tag":"name=Duration, type=INT64, convertedtype=TIME_MICROS, repetitiontype=OPTIONAL"}
]}
❯ parquet-tools schema /tmp/test-Pandas-Pyarrow.parquet
{"Tag":"name=Schema",
"Fields":[
{"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
{"Tag":"name=Duration, type=INT64, repetitiontype=OPTIONAL"}]}
❯ parquet-tools schema /tmp/test-PySpark.parquet
{"Tag":"name=Spark_schema",
"Fields":[
{"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
{"Tag":"name=Duration, type=INT64, repetitiontype=OPTIONAL"}
]}
Am I missing something? Is there any plan to support Interval type in the different tools or any decision to omit it? Otherwise, I guess issues should be created for the different tools (except DuckDB) to request implementing that part of the specification, I didn't find any.
And on a more practical level, is there any practical solution other than representing intervals with integer or float types and specifying the unit in the column name?
Upvotes: 1
Views: 113