brandata
brandata

Reputation: 81

Have two different strings that represent dates in two different hive tables , and I want to use them to join

So I have two external tables in Hive, in my Hadoop cluster.

One table has a (date STRING) column, with this format '2019-05-24 11:16:31.0'

and the other one has (date STRING) column, with this format '23/May/2019:22:15:04', they are both strings. I need to transform them to the same type of date format and use them to join these two tables.

How would you aproach this problem solving it all within hive? Would it be possible? I'm quite the rookie in Hadoop, And I'm not fully aware of the possibilities of hive.

Ps: My hive version does not support !hive --version command to check what version I'm working with, so I'm not pretty sure how to understand what version I'm working on. Not my cluster and I'm not a root user.

Upvotes: 2

Views: 530

Answers (2)

leftjoin
leftjoin

Reputation: 38290

You need to convert both strings to the same format before joining.

Converting non-standard format '23/May/2019:22:15:04'

Use unix_timestamp(string date, string pattern) to convert given date format to seconds passed from 1970-01-01. Then use from_unixtime() to convert to required format:

select from_unixtime(unix_timestamp('23/May/2019:22:15:04','dd/MMM/yyyy:HH:mm:ss'));

returns:

2019-05-23 22:15:04

If you want date only, specify date format 'yyyy-MM-dd' in the from_unixtime function:

select from_unixtime(unix_timestamp('23/May/2019:22:15:04','dd/MMM/yyyy:HH:mm:ss'),'yyyy-MM-dd');

Returns:

2019-05-23

Second table contains more standard format '2019-05-24 11:16:31.0' and you can do with simpler approach.

You can use simple substr, because the date is already in the Hive format 'yyyy-MM-dd':

select substr('2019-05-24 11:16:31.0',1,10);

Returns:

2019-05-24

Or if you want the same format as in the first example 'yyyy-MM-dd HH:mm:ss':

select substr('2019-05-24 11:16:31.0',1,19);

Returns:

2019-05-24 11:16:31

Also date_format (as of Hive 1.2.0) function can be used for the same:

select date_format('2019-05-24 11:16:31.0','yyyy-MM-dd HH:mm:ss');

Returns:

2019-05-24 11:16:31

And date portion only using date_format (as of Hive 1.2.0):

select date_format('2019-05-24 11:16:31.0','yyyy-MM-dd')

Upvotes: 3

Shawn.X
Shawn.X

Reputation: 1353

OK, you can use the String Functions and Operators in hive to make the two different date format to be same, like below:

select regexp_replace(substring('2019-05-24 11:16:31.0',0,10),'-','') as date;
+-----------+
|   date    |
+-----------+
| 20190524  |
+-----------+

select concat(split(substring_index('23/May/2019:22:15:04',':',1),'/')[2],case when split(substring_index('23/May/2019:22:15:04',':',1),'/')[1]='May' then '05' end,split(substring_index('23/May/2019:22:15:04',':',1),'/')[0]) as date;
+-----------+
|   date    |
+-----------+
| 20190523  |
+-----------+

And then join them, below is a simple example to clarify how to use, you can refine the details.

select
    *
from
    table1 t1
join
    table2 t2 regexp_replace(substring(t1.date,0,10),'-','') = select concat(split(substring_index(t2.date,':',1),'/')[2],case when split(substring_index(t2.date,':',1),'/')[1]='May' then '05' end,split(substring_index(t2.date,':',1),'/')[0])

Am I make it clear?

Upvotes: 1

Related Questions