SrinR
SrinR

Reputation: 1023

Filter String data in Spark dataframe where data has null values

I am looking for a way to use filter on a filed in DataFrame which has null data. Below is my sample DataFrame with two fields: id and value. value field has a null value in it.

val testData = Array((1,"actualstring1"),(2,null),(3,"actualstring2"),(4,"testString1"))
val testDataDF = sc.parallelize(testData).toDF("id", "value")

I used the below code snippet to filter out the test strings assuming the output to have three records. To my surprise I've got only below two records:

testDataDF.filter(!col("value").contains("test")).show

which gives the below result:

+---+-------------+
| id|        value|
+---+-------------+
|  1|actualstring1|
|  3|actualstring2|
+---+-------------+

here we see that the record with id=2 is ignored in this filteration process. Im now stuck how to include the row for id=2 aswell in the output along with the two rows we are getting.

Appreciate any help

Upvotes: 0

Views: 811

Answers (2)

locoyou
locoyou

Reputation: 1697

You can test col("value") != null in the filter:

testDataDF.filter(col("value") != null && !col("value").contains("test")).show

Upvotes: 1

zero323
zero323

Reputation: 330083

You replace current condition with a one which defaults to FALSE:

not(coalesce(col("value").contains("test"), lit(false))

where

lit(false)

is a boolean literal, and

coalesce(_, _)

returns the first NOT NULL element, counting from the left, or NULL if such element doesn't exist.

Upvotes: 1

Related Questions