horatio1701d
horatio1701d

Reputation: 9169

Filtering Dataset in Spark with String Search

I'm trying to just do a simple string filter with the dataset api using startsWith but I can't get the below statement to work. I can use contains like this. Not sure what I'm missing here.

  ds.filter(_.colToFilter.toString.contains("0")).show(false)

But this just produces an empty dataset but I know that the string is there in the value.

  ds.filter(_.colToFilter.toString.startsWith("0")).show(false)

Upvotes: 0

Views: 3125

Answers (2)

Sivaprasanna Sethuraman
Sivaprasanna Sethuraman

Reputation: 4132

Try the following:

val d = ds.filter($"columnToFilter".contains("0"))

or

val d = ds.filter($"columnToFilter".startsWith("0"))

Example

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Assume we have the above dataset, the output will be:

> var d = ds.filter($"name".contains("n"))

+---+------+
|age|  name|
+---+------+
| 30|  Andy|
| 19|Justin|
+---+------+

> var d = ds.filter($"name".startsWith("A"))

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

Upvotes: 1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

You can use subString inbuilt function as

Scala

import org.apache.spark.sql.functions._
df.filter(substring(col("column_name-to-be_used"), 0, 1) === "0")

Pyspark

from pyspark.sql import functions as f
df.filter(f.substring(f.col("column_name-to-be_used"), 0, 1) == "0")

So you can substring to as many characters you want to check in starts-with

Upvotes: 1

Related Questions