user2201536
user2201536

Reputation: 109

Athena/Hive timestamp in parquet files written by spark

We are writing files from spark and read from Athena/Hive. We had an issue with timestamp when using hive.

 scala> val someDF = Seq((8, "2018-06-06 11:42:43")).toDF("number", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]

scala> someDF.coalesce(1).write.mode("overwrite").option("delimiter", "\u0001").save("s3://test/")

This creates a parquet file and I created a table

CREATE EXTERNAL TABLE `test5`(
  `number` int, 
  `word` timestamp)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://test/'

Select query Failed with issue: HIVE_BAD_DATA: Field word's type BINARY in parquet is incompatible with type timestamp defined in table schema

Same thing is working when testing with plain csv file.

scala>someDF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite").option("delimiter", "\u0001").save("s3://test")

Table:
CREATE EXTERNAL TABLE `test7`(
  `number` int, 
  `word` timestamp)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://test/'

Can you please help with what is going wrong when we write it as parquet file.

Upvotes: 3

Views: 4207

Answers (1)

Michail N
Michail N

Reputation: 3855

I think this is a well known bug with Hive storing parquet timestamps in a way that is incompatible with other tools. I have faced a similar problem while using Impala to retrieve Hive data that I have written with Spark. I believe this was resolved in Spark 2.3.

Upvotes: 1

Related Questions