mlenthusiast
mlenthusiast

Reputation: 1204

Editing Parquet files with Python causes errors to datetime format

I'm trying to edit a set of parquet files that were written in Spark. I'm loading them into pandas dataframes in Python, using Pyarrow. The goal is to append new rows with some data and output a new set of parquet files.

The issue I'm having is with the date/time column, which after loading the parquet files in python shows up as dtype('<M8[ns]'. After adding rows with new times, and outputting new parquet files, I'm using Amazon Athena to query the files. At this point, the values in this column show up in this format +50975-10-19 00:00:00.000. The correct format should be 2019-01-05 00:00:00.000

Here is a snippet of the python code that shows the workflow to load, edit and output these parquet files:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

df1 = pd.read_parquet('example.snappy.parquet')

df1 = df1.append({'visitor_id' : '1234' , 'visitor_attribute_1' : True} , ignore_index=True)

#entering a new time stamp for above user
df1['visit_dates'].values[131] = np.datetime64('2019-01-03T00:02:11')

#output new parquet file with new row
table1 = pa.Table.from_pandas(df1)
pq.write_table(table1, 'example.snappy.parquet', compression='SNAPPY')

Any help with this would be really appreciated!

Upvotes: 2

Views: 2983

Answers (1)

Uwe L. Korn
Uwe L. Korn

Reputation: 8816

The problem here is that Athena only understands the deprecated INT96 timestamps in the Arrow format and otherwise it will not look at the logical type of the column but simply see that it is reading a column with INT64 as the physical type. The logical type annotation of TIMESTAMP_NANOS or TIMESTAMP_MICROS is ignored.

There are two possibilities you could try with pyarrow.parquet:

  1. You can add flavor="spark" to pyarrow.parquet.write_table to enable the Spark compatability mode and write the deprecated timestamp type: https://arrow.apache.org/docs/python/parquet.html#using-with-spark
    1. You can force timestamps to be written with millisecond precision by adding coerce_timestamps="ms" to pyarrow.parquet.write_table: https://arrow.apache.org/docs/python/parquet.html#data-type-handling

I would recommend the latter option as it is more future proof.

Upvotes: 2

Related Questions