HHH
HHH

Reputation: 6475

how to replace a string in Spark DataFrame using regexp

I have a Dataframe in Spark and I would like to replace the values of different columns based on a simple regular expression which is if the value ends with "_P" replace it with "1" and if it ends with "_N" then replace it with "-1". There are multiple columns that I need to do the same replacement. I also need to do a casting at the end.

Upvotes: 1

Views: 3484

Answers (1)

alghimo
alghimo

Reputation: 2899

You can do it through expressions like "when('column.endsWith("_P"), lit("1")).when...". The same could be achieved by using regexp_replace. Here's an example using the when:

val myDf = sc.parallelize(Array(
    ("foo_P", "bar_N", "123"),
    ("foo_N", "bar_Y", "123"),
    ("foo", "bar", "123"),
    ("foo_Y", "bar_XX", "123")
)).toDF("col1", "col2", "col3")

val colsToReplace = Seq("col1", "col2")

import org.apache.spark.sql.Column

val castValues = (colName: String) => {
    val col = new Column(colName)

    when(col.endsWith("_P"), lit("1"))
    .when(col.endsWith("_F"), lit("-1"))
    .otherwise(col)
    .as(colName)
}

val selectExprs = myDf.columns.diff(colsToReplace).map(new Column(_)) ++ colsToReplace.map(castValues)

myDf.select(selectExprs:_*).show
/*
+----+-----+------+
|col3| col1|  col2|
+----+-----+------+
| 123|    1| bar_N|
| 123|foo_N| bar_Y|
| 123|  foo|   bar|
| 123|foo_Y|bar_XX|
+----+-----+------+
*/

EDIT

By the way, regarding your comment on what you tried: The "df.na" functions is meant to work on rows containing NULL values, so, even if what you tried worked, it would work only on rows containing nulls. Apart from that, the "replace" doesn't work with regular expressions, at least it didn't the last time I checked.

Cheers

Upvotes: 3

Related Questions