rohit
rohit

Reputation: 96

Hive query to transform an array of struct to array of string

My use case is as following:

I want to copy data from Table-A to Table-B and transform field1 from array of struct to array of string where the string is the val1 property of struct in table-A and ignore val2.

Table-A:
field1: array<struct<val1: str, val2: int>>
sample data:
[{val1: "abc", val2: 123}, {val1: "def", val2: 456}], [{val1: "xyz", val2: 789}]

Table-B:
field1: array<string>
sample data:
["abc", "def"], ["xyz"]

I am not able to figure out how I can select the field1 column with transformation on it through hive query language.

The things I was able to figure out was I could explode the array, perform a select of the val1 and then do collect_list but after trying multiple times could not get the correct syntax.

My query was like:

select collect_list(select col.val1 
  from explode(field1) as col) from table-A

I also want to strictly do it through hiveql and not through a udf in python.

Thank you.

Upvotes: 0

Views: 5715

Answers (2)

Rogier Peters
Rogier Peters

Reputation: 21

There is some magic in the handling of arrays that allow you to do this:

select t.myarray as original_array, t.myarray.val1 from mydata t

I.e. selecting struct field val1 from an array of structs will return an array of val1

From http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html

Upvotes: 0

leftjoin
leftjoin

Reputation: 38325

Explode original array using Lateral View + explode and collect array of struct.val1 using collect_set() or collect_list():

with mydata as (--This is your data example, use your table instead of this CTE
select stack (2,
array(named_struct("val1", "abc", "val2", 123), named_struct("val1", "def", "val2", 456)), 
array(named_struct("val1", "xyz", "val2", 789))
) as myarray
)

select t.myarray as original_array, collect_set(s.val1) as result_array
  from mydata t
       lateral view explode(myarray) e as s --struct
group by t.myarray 

Result:

original_array                                          result_array
[{"val1":"abc","val2":123},{"val1":"def","val2":456}]   ["abc","def"]
[{"val1":"xyz","val2":789}]                             ["xyz"]

Your structure can be also declared as map, not struct. In this case use s['val1'] instead of s.val1 to get map element.

Upvotes: 1

Related Questions