Antonius
Antonius

Reputation: 77

Check if array columns have overlapping element

I have a dataframe with two array columns, looking as follows:

Arrayed_Column_1
[{"ID":222222,"No":2},{"ID":333333,"No":1}]
[{"ID":555555,"No":2},{"ID":333333,"No":1},{"ID":333333,"No":3}]
[{"ID":222222,"No":2},{"ID":555555,"No":1},{"ID":333333,"No":3}]
[{"ID":555555,"No":2},{"ID":333333,"No":1}]

Arrayed_Column_2
[{"ID":333333,"No":2},{"ID":666663,"No":1}]
[{"ID":333333,"No":2},{"ID":666666,"No":1},{"ID":333333,"No":3}]
[{"ID":222222,"No":2},{"ID":555555,"No":1},{"ID":333333,"No":3}]
[{"ID":555333,"No":2},{"ID":66666,"No":1}]

How can filter on those rows in which a combination of an ID and No of column_1 are also present in column_2 without using the explode function?

I know the array_contains function but this only checks for particular values.

Upvotes: 0

Views: 1626

Answers (2)

blackbishop
blackbishop

Reputation: 32670

You can also use exists + array_contains:

df1 = df.filter(
    "exists(Arrayed_Column_1, x -> array_contains(Arrayed_Column_2, x))"
)

Upvotes: 2

mck
mck

Reputation: 42352

Try using arrays_overlap:

import pyspark.sql.functions as F

col1 = F.expr('transform(column_1, x -> struct(x.ID as ID, x.No as No, x.Value2 as Value2))')
col2 = F.expr('transform(column_2, x -> struct(x.ID as ID, x.No as No, x.Value2 as Value2))')

df2 = df.filter(F.arrays_overlap(col1, col2))

Another way is to check array_intersect:

df2 = df.filter(F.size(F.array_intersect(col1, col2)) != 0)

Upvotes: 2

Related Questions