Bill
Bill

Reputation: 363

parquet time stamp overflow with fastparquet/pyarrow

I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see code and error and screen shot of parquet file below .I am not sure why this is happening ? 2022-09-01 00:00:00 seems to be fine. if I choose "pyarrow" as the engine, it fails with exception

pyarrow error:
    pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

Please advise.

fastparquet error:

OverflowError: value too large
Exception ignored in: 'fastparquet.cencoding.time_shift'
OverflowError: value too large
OverflowError: value too large

code:

s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket="blah", Key="blah1")
df=pd.read_parquet(io.BytesIO(obj['Body'].read()),engine="fastparquet")

Upvotes: 0

Views: 575

Answers (1)

mdurant
mdurant

Reputation: 28684

When pyarrow and fastparquet agree that the data isn't valid, I expect it must be the case. As a comment suggests, it sounds like there is confusion in the column's time units. You didn't say where the data came from, but at a wild guess, this may be because of the change in parquet standard (roughly v1->v2), in which the former complex types were extended by new "logical" types. Newer parquet files tend to have BOTH styles of type declaration, so there is a chance they are inconsistent.

In fastparquet main branch (unreleased), there was some work to consolidate different ways of declaring time types. Maybe for your data, it now does the right thing.

Upvotes: 0

Related Questions