Reputation: 149
I am new to Spark-Scala. I am trying to clean some data. I'm having an issue with cleaning up the FIRSTNAME and LASTNAME column. There are numbers present in the strings. How do I identify the numbers and replace the entire string with null.
Consider the following dataframe:
+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
| Steve| 10 C|
| Mark| 9436|
| Brian| Lara|
+---------+--------+
How do I get this:
+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
| Steve| null|
| Mark| null|
| Brian| Lara|
+---------+--------+
Any help would be greatly appreciated. Thank you very much!
EDIT:
scala> df2.withColumn("LASTNAME_TEMP", when(col("LASTNAME").contains("1"), null).otherwise(col("LASTNAME"))).show()
+---------+--------+-------------+
|FIRSTNAME|LASTNAME|LASTNAME_TEMP|
+---------+--------+-------------+
| Steve| 10 C| null|
| Mark| 9436| 9436|
| Brian| Lara| Lara|
+---------+--------+-------------+
But the above code takes in only one string. I would prefer it to take a list of strings. Ex:
val numList = List("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
I declared the above list and ran the following code:
scala> df2.filter(col("LASTNAME").isin(numList:_*)).show()
I got the following dataframe:
+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
+---------+--------+
Upvotes: 0
Views: 1211
Reputation: 22449
You can pattern match using regular expression with rlike
:
val df = Seq(
("Steve", "10 C"),
("Mark", "9436"),
("Brian", "Lara")
).toDF(
"FIRSTNAME", "LASTNAME"
)
// Keep original LASTNAME in new column only if it doesn't consist of any digit
val df2 = df.withColumn( "LASTNAMEFIXED", when( ! col("LASTNAME").rlike(".*[0-9]+.*"), col("LASTNAME") ) )
+---------+--------+-------------+
|FIRSTNAME|LASTNAME|LASTNAMEFIXED|
+---------+--------+-------------+
| Steve| 10 C| null|
| Mark| 9436| null|
| Brian| Lara| Lara|
+---------+--------+-------------+
Upvotes: 3