Reputation: 77
I have a dataframe with two array columns, looking as follows:
Arrayed_Column_1
[{"ID":222222,"No":2},{"ID":333333,"No":1}]
[{"ID":555555,"No":2},{"ID":333333,"No":1},{"ID":333333,"No":3}]
[{"ID":222222,"No":2},{"ID":555555,"No":1},{"ID":333333,"No":3}]
[{"ID":555555,"No":2},{"ID":333333,"No":1}]
Arrayed_Column_2
[{"ID":333333,"No":2},{"ID":666663,"No":1}]
[{"ID":333333,"No":2},{"ID":666666,"No":1},{"ID":333333,"No":3}]
[{"ID":222222,"No":2},{"ID":555555,"No":1},{"ID":333333,"No":3}]
[{"ID":555333,"No":2},{"ID":66666,"No":1}]
How can filter on those rows in which a combination of an ID and No of column_1 are also present in column_2 without using the explode
function?
I know the array_contains
function but this only checks for particular values.
Upvotes: 0
Views: 1626
Reputation: 32670
You can also use exists
+ array_contains
:
df1 = df.filter(
"exists(Arrayed_Column_1, x -> array_contains(Arrayed_Column_2, x))"
)
Upvotes: 2
Reputation: 42352
Try using arrays_overlap
:
import pyspark.sql.functions as F
col1 = F.expr('transform(column_1, x -> struct(x.ID as ID, x.No as No, x.Value2 as Value2))')
col2 = F.expr('transform(column_2, x -> struct(x.ID as ID, x.No as No, x.Value2 as Value2))')
df2 = df.filter(F.arrays_overlap(col1, col2))
Another way is to check array_intersect
:
df2 = df.filter(F.size(F.array_intersect(col1, col2)) != 0)
Upvotes: 2