Reputation: 1023
I am looking for a way to use filter on a filed in DataFrame which has null data. Below is my sample DataFrame with two fields: id and value. value field has a null value in it.
val testData = Array((1,"actualstring1"),(2,null),(3,"actualstring2"),(4,"testString1"))
val testDataDF = sc.parallelize(testData).toDF("id", "value")
I used the below code snippet to filter out the test strings assuming the output to have three records. To my surprise I've got only below two records:
testDataDF.filter(!col("value").contains("test")).show
which gives the below result:
+---+-------------+
| id| value|
+---+-------------+
| 1|actualstring1|
| 3|actualstring2|
+---+-------------+
here we see that the record with id=2
is ignored in this filteration process.
Im now stuck how to include the row for id=2
aswell in the output along with the two rows we are getting.
Appreciate any help
Upvotes: 0
Views: 811
Reputation: 1697
You can test col("value") != null
in the filter:
testDataDF.filter(col("value") != null && !col("value").contains("test")).show
Upvotes: 1
Reputation: 330083
You replace current condition with a one which defaults to FALSE
:
not(coalesce(col("value").contains("test"), lit(false))
where
lit(false)
is a boolean literal, and
coalesce(_, _)
returns the first NOT NULL
element, counting from the left, or NULL
if such element doesn't exist.
Upvotes: 1