Reputation: 109
We are writing files from spark and read from Athena/Hive. We had an issue with timestamp when using hive.
scala> val someDF = Seq((8, "2018-06-06 11:42:43")).toDF("number", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> someDF.coalesce(1).write.mode("overwrite").option("delimiter", "\u0001").save("s3://test/")
This creates a parquet file and I created a table
CREATE EXTERNAL TABLE `test5`(
`number` int,
`word` timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/'
Select query Failed with issue: HIVE_BAD_DATA: Field word's type BINARY in parquet is incompatible with type timestamp defined in table schema
Same thing is working when testing with plain csv file.
scala>someDF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite").option("delimiter", "\u0001").save("s3://test")
Table:
CREATE EXTERNAL TABLE `test7`(
`number` int,
`word` timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test/'
Can you please help with what is going wrong when we write it as parquet file.
Upvotes: 3
Views: 4207
Reputation: 3855
I think this is a well known bug with Hive storing parquet timestamps in a way that is incompatible with other tools. I have faced a similar problem while using Impala to retrieve Hive data that I have written with Spark. I believe this was resolved in Spark 2.3.
Upvotes: 1