modin3956
modin3956

Reputation: 149

If a number exists in a string, replace the string with null - Spark

I am new to Spark-Scala. I am trying to clean some data. I'm having an issue with cleaning up the FIRSTNAME and LASTNAME column. There are numbers present in the strings. How do I identify the numbers and replace the entire string with null.

Consider the following dataframe:

+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
|    Steve|    10 C|
|     Mark|    9436|
|    Brian|    Lara|
+---------+--------+

How do I get this:

+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
|    Steve|    null|
|     Mark|    null|
|    Brian|    Lara|
+---------+--------+

Any help would be greatly appreciated. Thank you very much!

EDIT:

scala> df2.withColumn("LASTNAME_TEMP", when(col("LASTNAME").contains("1"), null).otherwise(col("LASTNAME"))).show()
+---------+--------+-------------+
|FIRSTNAME|LASTNAME|LASTNAME_TEMP|
+---------+--------+-------------+
|    Steve|    10 C|         null|
|     Mark|    9436|         9436|
|    Brian|    Lara|         Lara|
+---------+--------+-------------+

But the above code takes in only one string. I would prefer it to take a list of strings. Ex:

val numList = List("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

I declared the above list and ran the following code:

scala> df2.filter(col("LASTNAME").isin(numList:_*)).show()

I got the following dataframe:

+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
+---------+--------+

Upvotes: 0

Views: 1211

Answers (1)

Leo C
Leo C

Reputation: 22449

You can pattern match using regular expression with rlike:

val df = Seq(
  ("Steve", "10 C"),
  ("Mark", "9436"),
  ("Brian", "Lara")
).toDF(
  "FIRSTNAME", "LASTNAME"
)

// Keep original LASTNAME in new column only if it doesn't consist of any digit
val df2 = df.withColumn( "LASTNAMEFIXED", when( ! col("LASTNAME").rlike(".*[0-9]+.*"), col("LASTNAME") ) )

+---------+--------+-------------+
|FIRSTNAME|LASTNAME|LASTNAMEFIXED|
+---------+--------+-------------+
|    Steve|    10 C|         null|
|     Mark|    9436|         null|
|    Brian|    Lara|         Lara|
+---------+--------+-------------+

Upvotes: 3

Related Questions