Reputation: 6475
I have a Dataframe in Spark and I would like to replace the values of different columns based on a simple regular expression which is if the value ends with "_P" replace it with "1" and if it ends with "_N" then replace it with "-1". There are multiple columns that I need to do the same replacement. I also need to do a casting at the end.
Upvotes: 1
Views: 3484
Reputation: 2899
You can do it through expressions like "when('column.endsWith("_P"), lit("1")).when...". The same could be achieved by using regexp_replace. Here's an example using the when:
val myDf = sc.parallelize(Array(
("foo_P", "bar_N", "123"),
("foo_N", "bar_Y", "123"),
("foo", "bar", "123"),
("foo_Y", "bar_XX", "123")
)).toDF("col1", "col2", "col3")
val colsToReplace = Seq("col1", "col2")
import org.apache.spark.sql.Column
val castValues = (colName: String) => {
val col = new Column(colName)
when(col.endsWith("_P"), lit("1"))
.when(col.endsWith("_F"), lit("-1"))
.otherwise(col)
.as(colName)
}
val selectExprs = myDf.columns.diff(colsToReplace).map(new Column(_)) ++ colsToReplace.map(castValues)
myDf.select(selectExprs:_*).show
/*
+----+-----+------+
|col3| col1| col2|
+----+-----+------+
| 123| 1| bar_N|
| 123|foo_N| bar_Y|
| 123| foo| bar|
| 123|foo_Y|bar_XX|
+----+-----+------+
*/
EDIT
By the way, regarding your comment on what you tried: The "df.na" functions is meant to work on rows containing NULL values, so, even if what you tried worked, it would work only on rows containing nulls. Apart from that, the "replace" doesn't work with regular expressions, at least it didn't the last time I checked.
Cheers
Upvotes: 3