Rob
Rob

Reputation: 162

How can i convert a timestamp to gmt format in hive

I have a timestamp column in my table, i'm deriving a column named dt_skey out of the timestamp column. For clear explanation lets assume that timestamp column name as time_column. This is how the time_column looks like 2017-02-05 03:33:50, dt_skey column looks like this 20170205033350 which is nothing but removing the symbols in between.

My question here : time_column is in est timezone, i want to convert it to the gmt format while i'm deriving dt_skey out of it. The reason why i want to do this is timestamp will be converted to the gmt format when we query it through impala where as dt_skey will not be converted as it is a int datatype. I'm doing the ingestion through hive where the timestamp and dt_skey column will be in sync when we query through hive. For the reporting purpose and users we use impala, so i want to make changes to the dt_skey column so that if user looks through impala both the columns should be in sync.

Below is the sql i'm using to derive dt_skey column out of a timestamp column:

cast(substr(regexp_replace(cast(time_column as string), '-',''),1,8) as int)as dt_skey

the above query will convert this 2017-02-02 07:32:51 into this 20170202.

Please help me to offset the dt_skey to GMT format. I also welcome solutions through spark.

Upvotes: 0

Views: 4180

Answers (4)

Rob
Rob

Reputation: 162

Thanks for all the solutions provided

All the answers here have the partial solution, using the answer resources I tried the below syntax and it worked.

cast(substr(regexp_replace(to_utc_timestamp(timestamp_column, 'EST') ,'-',''),1,8) as int) as dt_skey

To explain the above syntax, this is how my timestamp column looks like (yyyy-MM-dd HH:mm:ss) "2017-02-16 12:20:21"

After running the above syntax my output will be like '20170216' which is 'yyyyMMdd' regexp_replace will do the regex to display only yyyyMMdd. to_utc_timestamp(timestamp_column, 'EST') will convert the timestamp column to UTC timezone.

Upvotes: 0

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

Assuming you want a Hive query, that's how I convert a Hive TimeStamp column (using the current system timezone) into an Impala TimeStamp (using UTC which is the same as GMT except GMT is deprecated).

CREATE TEMPORARY MACRO to_impala_timestamp(ts TIMESTAMP)
  CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(ts) +CAST(CAST(PRINTF('%tz', ts) AS FLOAT)*36.0 AS INT)) AS TIMESTAMP)
;
--## WARNING - do not use MACROs if your Hive version is below V1.3 (Apache, Horton)
--## or below V1.1-CDH5.7.3, V1.1-CDH5.8.3, V1.1-CDH5.9.0 (Cloudera)
--## cf. "HIVE-11432 Hive macro give same result for different arguments"

The PRINTF('%tz', ts) will extract the timezone, taking care of Daylight Saving Time dynamically -- assuming that the timestamps you are processing relate to the system timezone used by your Hadoop cluster. If it's a different TZ then you have to adapt the macro accordingly.

You can test it with this query:

CREATE TABLE test_tz
STORED AS Parquet
AS
SELECT CAST(ts AS STRING) AS initial_ts_as_string
  , printf('%1$tz %1$tZ', ts) AS tzone_offset_and_code
  , ts AS ts_for_hive
  , to_impala_timestamp(ts) AS ts_for_impala
FROM ...

Our cluster uses Central European time, and that's how the result shows in Hive...

+--------------------------+--------------------+-----------------------------+-------------------------+
|  initial_ts_as_string    | tz_offset_and_code | ts_for_hive                 | ts_for_impala           |
+--------------------------+--------------------+-----------------------------+-------------------------+
| 2015-09-13 11:32:30.627  | +0200 CEST         | 2015-09-13 11:32:30.627     | 2015-09-13 13:32:30.0   |
| 2015-12-10 12:27:01.282  | +0100 CET          | 2015-12-10 12:27:01.282     | 2015-12-10 13:27:01.0   |
| 2016-05-17 15:49:06.386  | +0200 CEST         | 2016-05-17 15:49:06.386     | 2016-05-17 17:49:06.0   |

...then in Impala...

+-------------------------+--------------------+-------------------------------+---------------------+
|  initial_ts_as_string   | tz_offset_and_code | ts_for_hive                   | ts_for_impala       |
+-------------------------+--------------------+-------------------------------+---------------------+
| 2015-09-13 11:32:30.627 | +0200 CEST         | 2015-09-13 09:32:30.627000000 | 2015-09-13 11:32:30 |
| 2015-12-10 12:27:01.282 | +0100 CET          | 2015-12-10 11:27:01.282000000 | 2015-12-10 12:27:01 |
| 2016-05-17 15:49:06.386 | +0200 CEST         | 2016-05-17 13:49:06.386000000 | 2016-05-17 15:49:06 |

Note that the milliseconds are lost while running the conversion; they could be restored with an additional trick but usually it's beyond the point.


Side note: to format a TimeStamp (or Date or Float or whatever) into a String, the good old Java PRINTF() function is way more practical than using the default format plus REGEXP_***() functions...

Upvotes: 0

Bernd Buffen
Bernd Buffen

Reputation: 15057

you only must add 0 to your field like:

SELECT datetimefield+0;

SELECT CONVERT_TZ('2017-02-02 07:32:51','EST','GMT');

if CONVERT_TZ return NULL you can install the timezone tables like:

mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root -p mysql

sample

mysql> SELECT CONVERT_TZ('2017-02-02 07:32:51','EST','GMT');
+-----------------------------------------------+
| CONVERT_TZ('2017-02-02 07:32:51','EST','GMT') |
+-----------------------------------------------+
| 2017-02-02 12:32:51                           |
+-----------------------------------------------+
1 row in set (0,00 sec)

mysql>
mysql> SELECT DATE(TIMESTAMP('2017-02-02 07:32:51'))+0;
+------------------------------------------+
| DATE(TIMESTAMP('2017-02-02 07:32:51'))+0 |
+------------------------------------------+
|                                 20170202 |
+------------------------------------------+
1 row in set (0,00 sec)

mysql> select id, mydate, date(mydate), date(mydate)+0 from df;
+----+---------------------+--------------+----------------+
| id | mydate              | date(mydate) | date(mydate)+0 |
+----+---------------------+--------------+----------------+
|  1 | 2017-02-05 03:33:50 | 2017-02-05   |       20170205 |
+----+---------------------+--------------+----------------+
1 row in set (0,00 sec)

mysql>

mysql> SELECT TIMESTAMP('2017-02-05 03:33:50')+0;
+------------------------------------+
| TIMESTAMP('2017-02-05 03:33:50')+0 |
+------------------------------------+
|                     20170205033350 |
+------------------------------------+
1 row in set (0,00 sec)

mysql>
mysql> select id, mydate, mydate+0 from df;
+----+---------------------+----------------+
| id | mydate              | mydate+0       |
+----+---------------------+----------------+
|  1 | 2017-02-05 03:33:50 | 20170205033350 |
+----+---------------------+----------------+
1 row in set (0,00 sec)

mysql>

Upvotes: 0

Zhang Tong
Zhang Tong

Reputation: 4719

In Spark:

rdd = spark.sparkContext.parallelize([('2017-02-05 03:33:50',)])
df = spark.createDataFrame(rdd, ['EST'])
df = df.withColumn('GMT', f.to_utc_timestamp(df['EST'], 'EST'))
res = df.withColumn('YouWanna', f.date_format(df['GMT'], 'yyyyMMddHHmmss'))
res.show(truncate=False)

+-------------------+---------------------+--------------+
|EST                |GMT                  |YouWanna      |
+-------------------+---------------------+--------------+
|2017-02-05 03:33:50|2017-02-05 08:33:50.0|20170205083350|
+-------------------+---------------------+--------------+

Or in hive:

select date_format(to_utc_timestamp('2017-02-05 03:33:50','EST'), 'yyyyMMddHHmmss') from dual

Do you mean this ?

Upvotes: 1

Related Questions