Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, 
as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar.
See more details in SPARK-31404.
You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. 
Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

I tried everything to set the int96RebaseModeInRead config in Glue, even contacted the Support, but it seems that currently Glue is overwriting that flag and you can not set it yourself.

If anyone knows a workaround, that would be great. Otherwise I will continue with Glue 2.0. and wait for the Glue dev team to fix this.

Upvotes: 25

Answers (5)

ABHISHEK SINGH

Reputation: 1

In Scala-Spark. Changes in script are as follows.

import org.apache.spark.{SparkContext,SparkConf}

   val conf = new SparkConf()
     .set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
     .set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
     .set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
     .set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

   val sc = SparkContext.getOrCreate(conf)

Upvotes: 0

avikm

Reputation: 802

From spark version 3.2, spark.sql.legacy.parquet.* is deprecated.

2023-04-03 21:27:13.362 thread=main, log_level=WARN , [o.a.s.s.internal.SQLConf], The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.

So need to use below spark configs:

conf.set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")

Upvotes: 16

Viktor

Reputation: 324

In some cases, the job can even fail with the required configs properly set. For example, it will fail when reading dataframe and then calling .rdd method. The issue is that in some cases when reading data using Glue API with DynamicDataFrame as output it will use RDD API internally. And this code fails because required flags are not propagated into the execution stage.

In order to fix this issue, we additionally wrap this code with

SqlExecution.withSQLConfPropagated{
   glueContext.getSource(...).getDynamicFrame
}

Then required SQL configs will propagate properly to the execution stage and will be taken into account.

Upvotes: 1

Robert Kossendey

Reputation: 6998

I made it work by setting --conf to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.

Also this is still very buggy. You can not call .show() on a DynamicFrame for example, you need to call it on a DataFrame. Also all my jobs failed where I call data_frame.rdd.isEmpty(), don't ask me why.

Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:

sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)

Upvotes: 31

Mark K.

Reputation: 2099

The issue being addressed in the official Glue Developer guide

Migrating from AWS Glue 2.0 to AWS Glue 3.0 last bullet item.

Upvotes: 7

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

Answers (5)

Related Questions