Reputation: 397
Hello I'd like to join on array intersection. I've found an arrays_overlap function on spark -- yet I cannot seem to get it to work. I've also tried writing a custom UDF to no avail. There error message I receive is "requires attributes from more than one child". I have no idea what that means making this hard to debug. Am I missing something basic about this? Using Hive/pyspark
Sample queries:
select
a.id
from tbl a
JOIN tbl_b b
ON arrays_overlap(a.my_arr, b.my_arr) = TRUE
def _arrays_overlap(a,b):
for item in a:
if item in b:
return True
return False
spark.udf.register(
"_arrays_overlap",
_arrays_overlap,
BooleanType()
)
select
a.id
from tbl a
JOIN tbl_b b
ON _arrays_overlap(a.my_arr, b.my_arr) = TRUE
What am I missing here?
Upvotes: 0
Views: 1537
Reputation: 31490
From Spark-2.4
:
We can use array_intersect
function with size to join only the rows size !=0.
arrays_overlap
and joining on boolean TRUE
Example:
df=spark.createDataFrame([(1,[1,2,3]),(2,[4,5])],["id","my_arr"])
df1=spark.createDataFrame([(1,[2,3]),(2,[8,9])],["id","my_arr"])
df.createOrReplaceTempView("tbl")
df1.createOrReplaceTempView("tbl_b")
spark.sql("select a.id from tbl a join tbl_b b on arrays_overlap(a.my_arr,b.my_arr) = TRUE").show()
spark.sql("select a.id from tbl a join tbl_b b on SIZE(array_intersect(a.my_arr,b.my_arr)) != 0").show()
#+---+
#| id|
#+---+
#| 1|
#+---+
Upvotes: 2