Leonard
Leonard

Reputation: 165

Pyspark: Filtering rows on multiple columns

I have a dataframe and I want to filter data as below condition.

df = spark.createDataFrame([(101, 1, 2, 12, 34), (102, 1, 1, 45, 12), (103, 1, 34, 10, 0), (104, None, 2, 91, 76), (
    105, 2, 67, 23, 54), (107, 34, 255, 234, 99), (109, None, 1, 23, 55)], ['ID', 'A', 'B', 'C', 'D'])
df.show()

+---+----+---+---+---+
| ID|   A|  B|  C|  D|
+---+----+---+---+---+
|101|   1|  2| 12| 34|
|102|   1|  1| 45| 12|
|103|   1| 34| 10|  0|
|104|null|  2| 91| 76|
|105|   2| 67| 23| 54|
|107|  34|255|234| 99|
|109|null|  1| 23| 55|
+---+----+---+---+---+

filtering condition

  1. If column A = 1, select row
  2. If column A is null then check the column B and if column B = 1, select row

Expected output

+---+----+---+---+---+
| ID|   A|  B|  C|  D|
+---+----+---+---+---+
|101|   1|  2| 12| 34|
|102|   1|  1| 45| 12|
|103|   1| 34| 10|  0|
|109|null|  1| 23| 55|
+---+----+---+---+---+

My code:

import pyspark.sql.functions as F
df = df.filter((F.col('A') ==1 )& (F.col('A').isNull()) & (F.col('B') ==1))
df.show()

I am getting empty dataframe. Can someone help me. Thanks.

Upvotes: 0

Views: 2651

Answers (1)

akuiper
akuiper

Reputation: 215127

You need | condition instead of & between condition 1 and 2:

import pyspark.sql.functions as F
df.filter((F.col('A') == 1 ) | (F.col('A').isNull() & (F.col('B') ==1)))

Upvotes: 3

Related Questions