Reputation: 35
We recently upgraded our server from CDH 5 to CDH 6 . When inserting data to TIMESTAMP columns using SPARK in parquet tables there is difference how data is inserted.
CDH 5:
HIVE:
If we insert 2019-01-30
to TIMESTAMP column of parquet table and select data from Hive value is '2019-01-30 00:00:00 0'
CDH 6:
HIVE:
If we insert 2019-01-30
to TIMESTAMP column of parquet table and select data from HIVE value is '2019-01-30 04:00:00'
IMPALA:
If we insert 2019-01-30
to TIMESTAMP column of parquet table and select data from IMPALA value is '2019-01-30 04:00:00'
Please let me know if there is any spark properties we can use . My primary goal is to match HIVE value in CDH5 vs CDH6 and If possible when we select from IMPALA if should be 2019-01-30 00:00:00'
Upvotes: 0
Views: 365
Reputation: 2838
To skip issues with data type between Spark and Hive the convention used by Spark to write Parquet data is configurable.
This is determined by the property spark.sql.parquet.writeLegacyFormat
. The default value is false
. If set to true
, Spark will use the same convention as Hive for writing the Parquet data.
val spark = SparkSession
.builder()
.appName("MyApp")
.master("local[*]")
.config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
.config("spark.sql.parquet.writeLegacyFormat", true)
Upvotes: 1