vegetarianCoder
vegetarianCoder

Reputation: 2958

Handling nulls and missing data in pyspark

I have a dataframe which looks like this:

Sample Data

When I try to filter out the <'null'>, 'null' and missing value from the cancellation column using :

runner_orders\
    .filter((col("cancellation").isin('null',''))&(col("cancellation").isNull()))\
    .show()

the above code gives me empty dataframe like this:

+--------+---------+-----------+--------+--------+------------+
|order_id|runner_id|pickup_time|distance|duration|cancellation|
+--------+---------+-----------+--------+--------+------------+
+--------+---------+-----------+--------+--------+------------+

But, when I try using individual transformation like this:

runner_orders\
    .filter(col("cancellation").isin('null',''))\
    .show()

+--------+---------+-------------------+--------+----------+------------+
|order_id|runner_id|        pickup_time|distance|  duration|cancellation|
+--------+---------+-------------------+--------+----------+------------+
|       1|        1|2020-01-01 18:15:34|    20km|32 minutes|            |
|       2|        1|2020-01-01 19:10:54|    20km|27 minutes|            |
|       7|        2|2020-01-08 21:30:45|    25km|    25mins|        null|
|       8|        2|2020-01-10 00:15:02| 23.4 km| 15 minute|        null|
|      10|        1|2020-01-11 18:50:20|    10km| 10minutes|        null|
+--------+---------+-------------------+--------+----------+------------+

runner_orders\
    .filter(col("cancellation").isNull())\
    .show()

+--------+---------+-------------------+--------+--------+------------+
|order_id|runner_id|        pickup_time|distance|duration|cancellation|
+--------+---------+-------------------+--------+--------+------------+
|       3|        1|2020-01-03 00:12:37|  13.4km| 20 mins|        null|
|       4|        2|2020-01-04 13:53:03|    23.4|      40|        null|
|       5|        3|2020-01-08 21:10:57|      10|      15|        null|
+--------+---------+-------------------+--------+--------+------------+

It gives me all the order_id with <'null'>,null and missing values. But when I put both condition together, it did not work.

Is there any way through which I can filter out all the order_id it where cancellation is ,'null' or missing in pyspark ?

(I know how to do it in sparksql but I want to do this in pyspark way)

Upvotes: 1

Views: 1059

Answers (1)

Dipanjan Mallick
Dipanjan Mallick

Reputation: 1739

You are getting empty values because you've used &, which will return true only if both the conditions are satisfied and is corresponding to same set of records.

Try using | in place of & like below -

runner_orders\
    .filter((col("cancellation").isin('null','')) | (col("cancellation").isNull()))\
    .show()

Upvotes: 2

Related Questions