Munna
Munna

Reputation: 13

regular expression in hive for a specific string

I have a column in hive table which is a address column and i want to split that into 2.

There are 2 scenarios to take care of.

Example:

Scenario 1:

Another example can be like below.

MICHAEL POSTON875 HYDERABAD

In this case separation should be based on a number which is part of a string value, if a string is having number in it then both should separate

Scenario 2:

Another example can be like below.

Hyderabad jhillsPO BOX 522002

In this case separation should be based on PO BOX

Both the data is in same column and i would like to update the data into target based on the string format..like a case statement not sure about the approach.

NOTE:- The string length can be varied as this is address column. Can some one please help me to provide a hive query and pyspark for the same?

Upvotes: 1

Views: 1302

Answers (1)

leftjoin
leftjoin

Reputation: 38290

Using CASE expression you can check which template does it match and using regexp_replace insert some delimiter, then split by the same delimiter.

Demo (Hive):

with mytable as (
select stack(4,
'ABC DEF123 AD',
'MICHAEL POSTON875 HYDERABAD',
'ABC DEFPO BOX 5232',
'Hyderabad jhillsPO BOX 522002'
) as str
) --Use your table instead of this

select columns[0] as col1, columns[1] as col2
from
(
select split(case when (str rlike 'PO BOX') then regexp_replace(str, 'PO BOX','|||PO BOX')
                  when (str rlike '[a-zA-Z ]+\\d+') then regexp_replace(str,'([a-zA-Z ]+)(\\d+.*)', '$1|||$2')
                 --add more cases and ELSE part
              end,'\\|{3}') columns
  from mytable
)s

Result:

col1              col2  
ABC DEF           123 AD
MICHAEL POSTON    875 HYDERABAD
ABC DEF           PO BOX 5232
Hyderabad jhills  PO BOX 522002

Upvotes: 1

Related Questions