Reputation: 567
I have column where its values delimited by pipe operator in hive table, which is highly unstructured, and I would like to split this column that delimited by pipe and get corresponding columns which is more than many apparently. So, I need to first parse the string start with HP1
then locate string between 30-40th pipe. I tried multiple of doing this, but not able to parse it correctly. Can anyone suggest possible workaround to do this in hive sql?
sample data and my current attempt
Here I pasted top 3 rows of my original data.
SSH|^~\&|EnsembleMK9|ISC|DELTA HIE^1.2.3.411593.135778^DEL|TESTMD|202209190035||ACK^A08|1214451793^137099156|P|2.5.1MSA|AA|1214451793^137099156
SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484|||||||||||||||||||||||||202209190135
SSH|^~\&|PC^1.2.3.411593.135778^DEL|PC|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484||||||||||||||||||||PCCMM|||||202209190135
here is my first attempt:
based on this SO post, I tried following:
WITH x AS (
SELECT colm
, CharIndex('|', colm) As first_pipe
FROM mytbl
)
, y AS (
SELECT colm
, first_pipe
, CharIndex('|', colm, first_pipe + 1) As second_pipe
, SubString(colm, 0, first_pipe) As first_element
FROM x
)
, z AS (
SELECT colm
, first_pipe
, second_pipe
, first_element
, SubString(colm, first_pipe + 1, second_pipe - first_pipe - 1) As second_element
, SubString(colm, second_pipe + 1, LENGTH(colm) - second_pipe) As third_element
FROM y
)
SELECT *
FROM z
limit 10
but I have error because CharIndex
doesn't work for hive sql.
Here is my another attempt:
SELECT
colm,
INSTR(colm, 'HP1') AS HP1_position, -- Find the position of "HP1" in the string
SUBSTR(colm, INSTR(colm, 'HP1')) AS substring_after_HP1, -- Extract the substring after "HP1" until the end
SPLIT(SUBSTR(colm, INSTR(colm, 'HP1')), '\|') AS pipe_split, -- Split the substring by the pipe "|" delimiter
pipe_split[39] AS extracted_value -- Extract the 40th element from the split data
FROM
mytbl
this query also don't work, because pipe_split[39]
give me error.
Is there any workaround to pipe delimited column into new list of columns correctly in hive sql? can anyone suggest possible approach on that? Do I need to write dynamic sql instaed? if so, how? any ideas?
objective
I want to split pipe delimited column into new columns. How can I do this in hive sql?
Upvotes: 3
Views: 540
Reputation: 13
This query selects the column from the MyTable and then it splits the content of column (a string with | as a delimiter) into its constituent parts. It assigns these parts to new columns named first_element, second_element, and third_element. At the end, it limits the results to 10 rows.
SELECT column,
split(colm, '\\|')[0] AS first_element,
split(colm, '\\|')[1] AS second_element,
split(colm, '\\|')[2] AS third_element
FROM MyTable
LIMIT 10;
Upvotes: -2
Reputation: 672
You might want to try FUNCTION
, I have created string_split
wrapper to split data based on delimiter and return the values in separated column.
CREATE FUNCTION fn_Split50
(
@str varchar(max),
@delim char(1),
@columnCnt int = 125
)
RETURNS TABLE
AS
RETURN
(
SELECT *
FROM (SELECT
nn = (nn - 1) / @columnCnt + 1,
nnn = 'value' + cast(((nn - 1) % @columnCnt) + 1 as varchar(10)),
value
FROM (SELECT
nn = ROW_NUMBER() over (order by (select null)),
value
FROM string_split(@str, @delim) aa
) aa
where nn > 0
) bb
PIVOT
(
max(value)
FOR nnn IN (
value1, value2, value3, value4, value5, value6, value7, value8, value9, value10,
value11, value12, value13, value14, value15, value16, value17, value18, value19, value20,
value21, value22, value23, value24, value25, value26, value27, value28, value29, value30,
value31, value32, value33, value34, value35, value36, value37, value38, value39, value40,
value41, value42, value43, value44, value45, value46, value47, value48, value49, value50,
value51, value52, value53, value54, value55, value56, value57, value58, value59, value60,
value61, value62, value63, value64, value65, value66, value67, value68, value69, value70,
value71, value72, value73, value74, value75, value76, value77, value78, value79, value80,
value81, value82, value83, value84, value85, value86, value87, value88, value89, value90,
value91, value92, value93, value94, value95, value96, value97, value98, value99, value100,
value101, value102, value103, value104, value105, value106, value107, value108, value109, value110,
value111, value112, value113, value114, value115, value116, value117, value118, value119, value120,
value121, value122, value123, value124, value125
)
) AS PivotTable
);
And you can call it this way:
select * from fn_split50('SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM', '|', DEFAULT);
First parameter is the string value Second parameter is the delimiter Third parameter is how many columns you want to be pre-provided, if there are more values then the provided value will wrap it to the next row.
You may test it out at this SQLFiddle link.
Upvotes: 3