Filtering Dataset in Spark with String Search

Question

I'm trying to just do a simple string filter with the dataset api using startsWith but I can't get the below statement to work. I can use contains like this. Not sure what I'm missing here.

  ds.filter(_.colToFilter.toString.contains("0")).show(false)

But this just produces an empty dataset but I know that the string is there in the value.

  ds.filter(_.colToFilter.toString.startsWith("0")).show(false)

Ramesh Maharjan · Accepted Answer

You can use subString inbuilt function as

Scala

import org.apache.spark.sql.functions._
df.filter(substring(col("column_name-to-be_used"), 0, 1) === "0")

Pyspark

from pyspark.sql import functions as f
df.filter(f.substring(f.col("column_name-to-be_used"), 0, 1) == "0")

So you can substring to as many characters you want to check in starts-with

Filtering Dataset in Spark with String Search

Answers (2)

Scala

Pyspark

Related Questions