Reputation: 89
Input dataframe
import spark.implicits._
val ds = Seq((1,"play Framwork"),
(2,"Spark framework"),
(3,"spring framework")).toDF("id","subject")
I could use any regex and the my function should remove those rows
from the dataframe that matches the regex token .
Suppose my regex is ^play.* then my function should remove first row and produces the following result .
val exp = Seq((2,"Spark framework"),
(3,"spring framework")).toDF("id","subject")
I was thinking to use a function like below
def clearValueUsingRegex(dataFrame: DataFrame, token: String, columnsToBeUpdated: List[String]) = {
Logger.debug(s"Inside clearValueUsingRegex : token :$token , columnsToBeUpdated : $columnsToBeUpdated")
if (isValidRegex(token)) {
columnsToBeUpdated.foldLeft(dataFrame) {
(dataset, columnName) =>
dataset.withColumn(columnName, regexp_replace(col(columnName), token, ""))
}
} else {
throw new NotValidRegularExpression(s"$token is not valid regex.")
}
}
But problem with this function is that it only replaces the particular cell value , not removing the complete row as my expected result .
Upvotes: 2
Views: 3025
Reputation: 1771
You can use filter function.
df.filter($"columnName" rlike "^play.*")
http://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/Dataset.html
Upvotes: 2