Reputation: 13
I have a column in hive table which is a address column and i want to split that into 2.
There are 2 scenarios to take care of.
ABC DEF123 AD
ABC DEF
123 AD
Another example can be like below.
MICHAEL POSTON875 HYDERABAD
In this case separation should be based on a number which is part of a string value, if a string is having number in it then both should separate
ABC DEFPO BOX 5232
ABC DEF
PO BOX 5232
Another example can be like below.
Hyderabad jhillsPO BOX 522002
In this case separation should be based on PO BOX
Both the data is in same column and i would like to update the data into target based on the string format..like a case statement not sure about the approach.
NOTE:- The string length can be varied as this is address column. Can some one please help me to provide a hive query and pyspark for the same?
Upvotes: 1
Views: 1302
Reputation: 38290
Using CASE expression you can check which template does it match and using regexp_replace insert some delimiter, then split by the same delimiter.
Demo (Hive):
with mytable as (
select stack(4,
'ABC DEF123 AD',
'MICHAEL POSTON875 HYDERABAD',
'ABC DEFPO BOX 5232',
'Hyderabad jhillsPO BOX 522002'
) as str
) --Use your table instead of this
select columns[0] as col1, columns[1] as col2
from
(
select split(case when (str rlike 'PO BOX') then regexp_replace(str, 'PO BOX','|||PO BOX')
when (str rlike '[a-zA-Z ]+\\d+') then regexp_replace(str,'([a-zA-Z ]+)(\\d+.*)', '$1|||$2')
--add more cases and ELSE part
end,'\\|{3}') columns
from mytable
)s
Result:
col1 col2
ABC DEF 123 AD
MICHAEL POSTON 875 HYDERABAD
ABC DEF PO BOX 5232
Hyderabad jhills PO BOX 522002
Upvotes: 1