Reputation: 1868
The structure of a hive table (tbl_a
) is as follows:
name | ids
A | [1,7,13,25168,992]
B | [223, 594, 3322, 192928]
C | null
...
Another hive table (tbl_b
) have the corresponding mapping between id to new_id. This table is big so cannot be loaded into memory
id | new_id
1 | 'aiks'
2 | 'ficnw'
...
I intend to create a new hive table to have the same structure as tbl_a
, but convert the array of id to the array of new_id:
name | ids
A | ['aiks','fsijo','fsdix','sssxs','wie']
B | ['cx', 'dds', 'dfsexx', 'zz']
C | null
...
Could anyone give me some idea on how to implement this scenario in spark sql or in spark DataFrame? Thanks!
Upvotes: 1
Views: 80
Reputation: 40370
This is an expensive operation but you can make it using a coalesce, explode and a left outer join as followed :
tbl_a
.withColumn("ids", coalesce($"ids", array(lit(null).cast("int"))))
.select($"name", explode($"ids").alias("id"))
.join(tbl_b, Seq("id"), "leftouter")
.groupBy("name").agg(collect_list($"new_id").alias("ids"))
.show
Upvotes: 2