Spark - HiveContext | Wrong Timestamps (substracts 4 hours)

Question

I was trying to do some ETL using SparkSQL (HiveContext) and I noticed some inconsistencies with the timestamps.

Let's say we have a table stored as parquet, with two columns: timestamp, event. If I query this table using the Hive editor from Hue, everything is ok.

SELECT * FROM mytable ORDER BY timestamp

Doing exactly the same query using hiveContext.sql(query) will give the same result as the Hue one but with (timestamp - 4 hours).

Another example, let's say we have this table:

Timestamp                   | event
--------------------------------------------------
'year-month-day 00:00:00'   | "evt0"
'year-month-day 01:00:00'   | "evt1"
'year-month-day 02:00:00'   | "evt2"
'year-month-day 03:00:00'   | "evt3"
'year-month-day 04:00:00'   | "evt4"
'year-month-day 05:00:00'   | "evt5"
'year-month-day 06:00:00'   | "evt6"
'year-month-day 07:00:00'   | "evt7"
'year-month-day 08:00:00'   | "evt8"

We run the following query using Spark and HiveContext:

SELECT * FROM mytable 
WHERE timestamp BETWEEN 'year-month-day 00:00:00' AND 'year-month-day 08:00:00'
ORDER BY timestamp

Result:

Timestamp                   | event
--------------------------------------------------
'year-month-day 00:00:00'   | "evt4"
'year-month-day 01:00:00'   | "evt5"
'year-month-day 02:00:00'   | "evt6"
'year-month-day 03:00:00'   | "evt7"
'year-month-day 04:00:00'   | "evt8"

Ramzy · Accepted Answer

When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time. Looks like your local time here is ET, hence the 4 hours difference. With Hive and Parque combination, you might need to modify the time change based on your required timezone. I am not exactly sure, how Hue returns the same time.

Spark - HiveContext | Wrong Timestamps (substracts 4 hours)

Answers (1)

Related Questions