test acc
test acc

Reputation: 591

timestamp in hive reading from csv created by spark coming as null

So I have input data coming like this:

HEADER_A|HEADER_B|HEADER_C|HEADER_D
2018-09-28 22:12:09.0461760|first|last|123
2018-09-28 22:14:11.0123550|first|second|124

I then read this into a dataframe, and save it as a temporary view named temp_view. Everything is read in as a string.

I will then query this temp_view:

val df = sqlContext.sql("""
    SELECT
    from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss') as ETL_RunDate,
    HEADER_A as timestamp,
    HEADER_B as string,
    HEADER_c as string,
    HEADER_D as string
    FROM temp_view
""")

And then write that dataframe to a file. This file with then look like

2018-10-23 05:32:49|2018-09-28T22:12:09.046Z|first|last|123
2018-10-23 05:32:49|2018-09-28T22:14:11.123Z|first|second|124

If I then use this data as the data for a hive table with schema:

CREATE EXTERNAL TABLE `testdb.testtable`(
  `runts` timestamp COMMENT '',
  `header_a` timestamp COMMENT '',
  `header_b` string COMMENT '',
  `header_c` string COMMENT '',
  `header_d` string COMMENT '')

And query this table -

select * from testdb.testtable;

header_a is coming as null when it should be coming as the underlying data is there.

Why is this happening and how can I fix this?

Upvotes: 2

Views: 759

Answers (1)

DemetriKots
DemetriKots

Reputation: 1224

The issue is that the timestamp format in your output file yyyy-MM-dd'T'HH:mm:ss.SSSX is different from the default Hive timestamp format yyyy-mm-dd hh:mm:ss[.f...].

You have two options:

  • Write out the file using the default format. You would have to update your write to something like df.write.format("csv").option("dateFormat", "yyyy-MM-dd hh:mm:ss.SSSSSSS")
  • Update the SERDEPROPERTIES of the table to read the format – ALTER TABLE testtable SET SERDEPROPERTIES ("timestamp.formats"= "yyyy-MM-dd'T'HH:mm:ss.SSSX")

Upvotes: 1

Related Questions