user1352683
user1352683

Reputation: 397

Join on Array Intersection

Hello I'd like to join on array intersection. I've found an arrays_overlap function on spark -- yet I cannot seem to get it to work. I've also tried writing a custom UDF to no avail. There error message I receive is "requires attributes from more than one child". I have no idea what that means making this hard to debug. Am I missing something basic about this? Using Hive/pyspark

Sample queries:

select
a.id
from tbl a
JOIN tbl_b b
ON arrays_overlap(a.my_arr, b.my_arr) = TRUE

def _arrays_overlap(a,b):
    for item in a:
        if item in b:
            return True
    return False

spark.udf.register(
    "_arrays_overlap",
    _arrays_overlap,
    BooleanType()
)


select
a.id
from tbl a
JOIN tbl_b b
ON _arrays_overlap(a.my_arr, b.my_arr) = TRUE

What am I missing here?

Upvotes: 0

Views: 1537

Answers (1)

notNull
notNull

Reputation: 31490

From Spark-2.4:

We can use array_intersect function with size to join only the rows size !=0.

  • (or) arrays_overlap and joining on boolean TRUE

Example:

df=spark.createDataFrame([(1,[1,2,3]),(2,[4,5])],["id","my_arr"])
df1=spark.createDataFrame([(1,[2,3]),(2,[8,9])],["id","my_arr"])
df.createOrReplaceTempView("tbl")
df1.createOrReplaceTempView("tbl_b")

spark.sql("select a.id from tbl a join tbl_b b on arrays_overlap(a.my_arr,b.my_arr) = TRUE").show()

spark.sql("select a.id from tbl a join tbl_b b on SIZE(array_intersect(a.my_arr,b.my_arr)) != 0").show()

#+---+
#| id|
#+---+
#|  1|
#+---+

Upvotes: 2

Related Questions