Hadoop/Hive query to split one column into several ones

Question

I am using HIVE with two tables looking like (more or less):

-TABLE1 defined as [(Variables : string),(Value1 : int),(Value2 : int)]

with field "Variables" looking like "x0,x1,x2,x3,...,xn"

-TABLE2 define as [(Value1Sum : int),(Value2Sum : int),(X1 : string),(X4 : string),(X17 : string)]

I "convert" table1 to table2 with the query :

INSERT OVERWRITE TABLE table2
    SELECT sum(v1), sum(v2), x1, x4, x17
        FROM (SELECT
                Value1 as v1,
                Value2 as v2,
                split(Variables, ",")[1] as x1,
                split(Variables, ",")[4] as x4,
                split(Variables, ",")[17] as x17 
              FROM Table1) tmp
        GROUP BY tmp.x1, tmp.x4, tmp.x17

Does Hive call 3 times the split function ?

Is there a way to make it more elegant ?

Is there a way to make it more generic ?

Best regards, CC

Matthew Rathbone · Accepted Answer

Yes it will call split each time. You can make it slightly more elegant:

Why not define Variables as an array column to start with? They you can access elements directly:

select Varaibles[1] from table1

I'm assuming you're using an external table, so you can do that like so:

create external table table1(variables array, a int, b int)
ROW FORMAT DELIMITED
    COLLECTION ITEMS TERMINATED BY ','
LOCATION 'hdfs://somewhere'

Hadoop/Hive query to split one column into several ones

Answers (1)

Related Questions