kim
kim

Reputation: 567

how to split pipe delimited column into new columns in hive sql?

I have column where its values delimited by pipe operator in hive table, which is highly unstructured, and I would like to split this column that delimited by pipe and get corresponding columns which is more than many apparently. So, I need to first parse the string start with HP1 then locate string between 30-40th pipe. I tried multiple of doing this, but not able to parse it correctly. Can anyone suggest possible workaround to do this in hive sql?

sample data and my current attempt

Here I pasted top 3 rows of my original data.

SSH|^~\&|EnsembleMK9|ISC|DELTA HIE^1.2.3.411593.135778^DEL|TESTMD|202209190035||ACK^A08|1214451793^137099156|P|2.5.1MSA|AA|1214451793^137099156
SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484|||||||||||||||||||||||||202209190135
SSH|^~\&|PC^1.2.3.411593.135778^DEL|PC|LEAG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|HIVEDBAKG^2.16.840.1.113883.3.2966.100.0.0.255.140^DEL|20220919053530.832||AKG^A08^AKG_A01|C1016708390|P|2.5.1|||NE||||||@SSH.3^NXG^2.16.840.1.113883.3.2966.1000.1005.152.4.264^[email protected]^PCCMM^2.16.840.1.113883.3.2966.100.0.0.255.124^DELEVN||20220919013529PID|1||5624118^^^PCCMM&2.16.840.1.113883.3.8932.101.2&DEL^MR||Anderson^Donald||19440711|M||2106-3^White|30 Pine Woods Road^^Hyde Park^NY^12538^USA||^PRS^PH^^1^845^2292777||^English|M||||||^Non-Hispanic||||||||N|||||||||PD1||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^^&&&&&&&&&PCP||||||||N|20220310ROL||AD|RCP|2022062001^^^^^^^^2.16.840.1.113883.4.6^^^^KOR^^^^^^^^HP|1|O|PKCL||||AA48^Diminico^Carlo^F^^^^^&2.16.840.1.113883.3.8932.101.4&DEL^^^^^^^^^^^^||||||||||||73828528~86654484||||||||||||||||||||PCCMM|||||202209190135

here is my first attempt:

based on this SO post, I tried following:

WITH x AS (
  SELECT colm
       , CharIndex('|', colm) As first_pipe
  FROM   mytbl
)
, y AS (
  SELECT colm
       , first_pipe
       , CharIndex('|', colm, first_pipe + 1) As second_pipe
       , SubString(colm, 0, first_pipe) As first_element
  FROM   x
)
, z AS (
  SELECT colm
       , first_pipe
       , second_pipe
       , first_element
       , SubString(colm, first_pipe  + 1, second_pipe - first_pipe - 1) As second_element
       , SubString(colm, second_pipe + 1, LENGTH(colm) - second_pipe) As third_element
  FROM   y
)
SELECT *
FROM z
limit 10

but I have error because CharIndex doesn't work for hive sql.

Here is my another attempt:

SELECT
    colm,
    INSTR(colm, 'HP1') AS HP1_position,   -- Find the position of "HP1" in the string
    SUBSTR(colm, INSTR(colm, 'HP1')) AS substring_after_HP1,    -- Extract the substring after "HP1" until the end
    SPLIT(SUBSTR(colm, INSTR(colm, 'HP1')), '\|') AS pipe_split,       -- Split the substring by the pipe "|" delimiter
    pipe_split[39] AS extracted_value       -- Extract the 40th element from the split data
FROM
    mytbl

this query also don't work, because pipe_split[39] give me error.

Is there any workaround to pipe delimited column into new list of columns correctly in hive sql? can anyone suggest possible approach on that? Do I need to write dynamic sql instaed? if so, how? any ideas?

objective

I want to split pipe delimited column into new columns. How can I do this in hive sql?

Upvotes: 3

Views: 540

Answers (2)

Learn Freak
Learn Freak

Reputation: 13

This query selects the column from the MyTable and then it splits the content of column (a string with | as a delimiter) into its constituent parts. It assigns these parts to new columns named first_element, second_element, and third_element. At the end, it limits the results to 10 rows.

SELECT column,
           split(colm, '\\|')[0] AS first_element,
           split(colm, '\\|')[1] AS second_element,
           split(colm, '\\|')[2] AS third_element
    FROM MyTable
    LIMIT 10;

Upvotes: -2

Raymond Natio
Raymond Natio

Reputation: 672

You might want to try FUNCTION, I have created string_split wrapper to split data based on delimiter and return the values in separated column.

CREATE FUNCTION fn_Split50
(   
    @str varchar(max),
    @delim char(1), 
    @columnCnt int = 125
)
RETURNS TABLE 
AS
RETURN 
(
    SELECT * 
    FROM (SELECT 
            nn = (nn - 1) / @columnCnt + 1, 
            nnn = 'value' + cast(((nn - 1) % @columnCnt) + 1 as varchar(10)), 
            value
        FROM (SELECT 
            nn = ROW_NUMBER() over (order by (select null)),
            value
            FROM string_split(@str, @delim) aa
            ) aa
        where nn > 0
    ) bb
    PIVOT  
    (  
    max(value) 
    FOR nnn IN (    
        value1, value2, value3, value4, value5, value6, value7, value8, value9, value10,
        value11, value12, value13, value14, value15, value16, value17, value18, value19, value20,
        value21, value22, value23, value24, value25, value26, value27, value28, value29, value30,
        value31, value32, value33, value34, value35, value36, value37, value38, value39, value40,
        value41, value42, value43, value44, value45, value46, value47, value48, value49, value50,
        value51, value52, value53, value54, value55, value56, value57, value58, value59, value60,
        value61, value62, value63, value64, value65, value66, value67, value68, value69, value70,
        value71, value72, value73, value74, value75, value76, value77, value78, value79, value80,
        value81, value82, value83, value84, value85, value86, value87, value88, value89, value90,
        value91, value92, value93, value94, value95, value96, value97, value98, value99, value100,
        value101, value102, value103, value104, value105, value106, value107, value108, value109, value110,
        value111, value112, value113, value114, value115, value116, value117, value118, value119, value120,
        value121, value122, value123, value124, value125
     )  
    ) AS PivotTable 
);

And you can call it this way:

select * from fn_split50('SSH|^~\&|DELTA HIE^1.2.3.411593.135778^DEL|PCCMM', '|', DEFAULT);

First parameter is the string value Second parameter is the delimiter Third parameter is how many columns you want to be pre-provided, if there are more values then the provided value will wrap it to the next row.

You may test it out at this SQLFiddle link.

Upvotes: 3

Related Questions