Reputation: 1204
I'm trying to edit a set of parquet
files that were written in Spark
. I'm loading them into pandas
dataframes in Python, using Pyarrow
. The goal is to append new rows with some data and output a new set of parquet files.
The issue I'm having is with the date/time column, which after loading the parquet files in python shows up as dtype('<M8[ns]'
. After adding rows with new times, and outputting new parquet files, I'm using Amazon Athena to query the files. At this point, the values in this column show up in this format +50975-10-19 00:00:00.000
. The correct format should be 2019-01-05 00:00:00.000
Here is a snippet of the python code that shows the workflow to load, edit and output these parquet files:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
df1 = pd.read_parquet('example.snappy.parquet')
df1 = df1.append({'visitor_id' : '1234' , 'visitor_attribute_1' : True} , ignore_index=True)
#entering a new time stamp for above user
df1['visit_dates'].values[131] = np.datetime64('2019-01-03T00:02:11')
#output new parquet file with new row
table1 = pa.Table.from_pandas(df1)
pq.write_table(table1, 'example.snappy.parquet', compression='SNAPPY')
Any help with this would be really appreciated!
Upvotes: 2
Views: 2983
Reputation: 8816
The problem here is that Athena only understands the deprecated INT96 timestamps in the Arrow format and otherwise it will not look at the logical type of the column but simply see that it is reading a column with INT64 as the physical type. The logical type annotation of TIMESTAMP_NANOS or TIMESTAMP_MICROS is ignored.
There are two possibilities you could try with pyarrow.parquet
:
flavor="spark"
to pyarrow.parquet.write_table
to enable the Spark compatability mode and write the deprecated timestamp type: https://arrow.apache.org/docs/python/parquet.html#using-with-spark
coerce_timestamps="ms"
to pyarrow.parquet.write_table
: https://arrow.apache.org/docs/python/parquet.html#data-type-handlingI would recommend the latter option as it is more future proof.
Upvotes: 2