Reputation: 2958
I have a dataframe which looks like this:
When I try to filter out the <'null'>, 'null' and missing value from the cancellation column using :
runner_orders\
.filter((col("cancellation").isin('null',''))&(col("cancellation").isNull()))\
.show()
the above code gives me empty dataframe like this:
+--------+---------+-----------+--------+--------+------------+
|order_id|runner_id|pickup_time|distance|duration|cancellation|
+--------+---------+-----------+--------+--------+------------+
+--------+---------+-----------+--------+--------+------------+
But, when I try using individual transformation like this:
runner_orders\
.filter(col("cancellation").isin('null',''))\
.show()
+--------+---------+-------------------+--------+----------+------------+
|order_id|runner_id| pickup_time|distance| duration|cancellation|
+--------+---------+-------------------+--------+----------+------------+
| 1| 1|2020-01-01 18:15:34| 20km|32 minutes| |
| 2| 1|2020-01-01 19:10:54| 20km|27 minutes| |
| 7| 2|2020-01-08 21:30:45| 25km| 25mins| null|
| 8| 2|2020-01-10 00:15:02| 23.4 km| 15 minute| null|
| 10| 1|2020-01-11 18:50:20| 10km| 10minutes| null|
+--------+---------+-------------------+--------+----------+------------+
runner_orders\
.filter(col("cancellation").isNull())\
.show()
+--------+---------+-------------------+--------+--------+------------+
|order_id|runner_id| pickup_time|distance|duration|cancellation|
+--------+---------+-------------------+--------+--------+------------+
| 3| 1|2020-01-03 00:12:37| 13.4km| 20 mins| null|
| 4| 2|2020-01-04 13:53:03| 23.4| 40| null|
| 5| 3|2020-01-08 21:10:57| 10| 15| null|
+--------+---------+-------------------+--------+--------+------------+
It gives me all the order_id with <'null'>,null and missing values. But when I put both condition together, it did not work.
Is there any way through which I can filter out all the order_id it where cancellation is ,'null' or missing in pyspark ?
(I know how to do it in sparksql but I want to do this in pyspark way)
Upvotes: 1
Views: 1059
Reputation: 1739
You are getting empty values because you've used &
, which will return true only if both the conditions are satisfied and is corresponding to same set of records.
Try using |
in place of &
like below -
runner_orders\
.filter((col("cancellation").isin('null','')) | (col("cancellation").isNull()))\
.show()
Upvotes: 2