Bella_18
Bella_18

Reputation: 632

How to compare two array of string columns in Pyspark

I want to compare two arrays and filter the data frame

condition_1 = AAA 
condition_2 = ["AAA","BBB","CCC"]

My spark data frame has a column with array of strings

df = df.withColumn("array_column", F.lit(["XXX","YYY","AAA"]))

# to filter a string condition_1 with the array column

    df = df.filter(
           F.col('array_column').isin(condition_1) &
           # second filter here
    )

But how can I filter condition_2 in in a similar way? since they are both arrays?

Code I tried:

    df = df.filter(
           F.col('array_column').isin(condition_1) &
          any(x in condition_2 for x in F.col('array_column'))
    )
 

But I get an error - Column is not iterable.

I also tried - bool(set(F.col('array_column')).intersection(condition_2))

But still have the same error. Can anyone help me with this?

Upvotes: 0

Views: 1024

Answers (1)

wwnde
wwnde

Reputation: 26676

Hope I got your question right. It wasnt as clear. Use pyspark's array functions

Data

condition_1 = 'AAA' 
condition_2 = ["AAA","BBB","CCC"]
df=spark.createDataFrame([('1A',    '3412asd','value-1',    ['XXX', 'YYY', 'AAA']),
('2B',  '2345tyu','value-2',    ['DDD', 'YFFFYY', 'GGG']),
('3C',  '9800bvd',  'value-3',  ['AAA']),
 ('3C', '9800bvd',  'value-1',  ['AAA', 'YYY', 'CCCC'])],
('ID',  'Company_Id',   'value' ,'array_column'))


df.show()
+---+----------+-------+------------------+
| ID|Company_Id|  value|      array_column|
+---+----------+-------+------------------+
| 1A|   3412asd|value-1|   [XXX, YYY, AAA]|
| 2B|   2345tyu|value-2|[DDD, YFFFYY, GGG]|
| 3C|   9800bvd|value-3|             [AAA]|
| 3C|   9800bvd|value-1|  [AAA, YYY, CCCC]|
+---+----------+-------+------------------+

Code

df.where((array_contains(col('array_column'), lit(condition_1)))&(size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0)).show(truncate=False)

Outcome

+---+----------+-------+----------------+
|ID |Company_Id|value  |array_column    |
+---+----------+-------+----------------+
|1A |3412asd   |value-1|[XXX, YYY, AAA] |
|3C |9800bvd   |value-3|[AAA]           |
|3C |9800bvd   |value-1|[AAA, YYY, CCCC]|
+---+----------+-------+----------------+

How it works

condition_1 ; get a boolean selection of where column contains string

array_contains(col('array_column'), lit(condition_1))

condition_2 ; This happens in stages

Intersect column with the list

 array_intersect(col('array_column'),array([lit(x) for x in condition_2]))

get the size of the outcome of 1 above

size(array_intersect(col('array_column'),array([lit(x) for x in` condition_2])))

Check that the intersection contains at least one item

 size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0

Finally, chain condition_1 and condition_2 using operant & and pass into the df.where() or df.filter() methods

Upvotes: 1

Related Questions