user1024962
user1024962

Reputation: 35

TIMESTAMP column issue CDH5 vs CDH6 in parquet table

We recently upgraded our server from CDH 5 to CDH 6 . When inserting data to TIMESTAMP columns using SPARK in parquet tables there is difference how data is inserted.

CDH 5:

HIVE:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from Hive value is '2019-01-30 00:00:00 0'

CDH 6:

HIVE:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from HIVE value is '2019-01-30 04:00:00'

IMPALA:
If we insert 2019-01-30 to TIMESTAMP column of parquet table and select data from IMPALA value is '2019-01-30 04:00:00'

Please let me know if there is any spark properties we can use . My primary goal is to match HIVE value in CDH5 vs CDH6 and If possible when we select from IMPALA if should be 2019-01-30 00:00:00'

Upvotes: 0

Views: 365

Answers (1)

Chema
Chema

Reputation: 2838

To skip issues with data type between Spark and Hive the convention used by Spark to write Parquet data is configurable.

This is determined by the property spark.sql.parquet.writeLegacyFormat. The default value is false. If set to true, Spark will use the same convention as Hive for writing the Parquet data.

val spark = SparkSession
    .builder()
    .appName("MyApp")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .config("spark.sql.parquet.writeLegacyFormat", true)

Upvotes: 1

Related Questions