Do UDF (which another spark job is needed) to each element of array column in SparkSQL

Question

The structure of a hive table (tbl_a) is as follows:

name | ids
A    | [1,7,13,25168,992]
B    | [223, 594, 3322, 192928]
C    | null
...

Another hive table (tbl_b) have the corresponding mapping between id to new_id. This table is big so cannot be loaded into memory

id | new_id
1  | 'aiks'
2  | 'ficnw'
...

I intend to create a new hive table to have the same structure as tbl_a, but convert the array of id to the array of new_id:

name | ids
A    | ['aiks','fsijo','fsdix','sssxs','wie']
B    | ['cx', 'dds', 'dfsexx', 'zz']
C    | null
...

Could anyone give me some idea on how to implement this scenario in spark sql or in spark DataFrame? Thanks!

eliasah · Accepted Answer

This is an expensive operation but you can make it using a coalesce, explode and a left outer join as followed :

tbl_a
    .withColumn("ids", coalesce($"ids", array(lit(null).cast("int"))))
    .select($"name", explode($"ids").alias("id"))
    .join(tbl_b, Seq("id"), "leftouter")
    .groupBy("name").agg(collect_list($"new_id").alias("ids"))
    .show

Do UDF (which another spark job is needed) to each element of array column in SparkSQL

Answers (1)

Related Questions