Narayanan
Narayanan

Reputation: 29

Multiple regex replace together in scala

I get as input to a function in scala a dataframe that has a column named vin. The column has values in the below format

1. UJ123QR8467
2. 0UJ123QR846
3. /UJ123QR8467
4. -UJ123QR8467

and so on.

The requirement is to clean the column vin based on the following rules.

1. replace **"/_-** as ""
2. replace first 0 as ""
3. if the value is more than 10 characters then make the value as NULL.

I would like to know is there any simplified way to achieve the above. I can only think of doing multiple .withcolumn during regex replace every time.

Upvotes: 0

Views: 2471

Answers (1)

Leo C
Leo C

Reputation: 22439

I would combine all Regex related changes in a single transformation and the length condition in another, as shown below:

import org.apache.spark.sql.functions._

val df = Seq(
  "UJ123QR8467", "0UJ123QR846", "/UJ123QR8467",
  "-UJ123QR8467", "UJ0123QR84", "UJ123-QR_846"
).toDF("vin")

df.
  withColumn("vin2", regexp_replace($"vin", "^[0]|[/_-]", "")).
  withColumn("vin2", when(length($"vin2") <= 10, $"vin2")).
  show
// +------------+----------+
// |         vin|      vin2|
// +------------+----------+
// | UJ123QR8467|      null|
// | 00UJ123QR84|0UJ123QR84|
// |/UJ123QR8467|      null|
// |-UJ123QR8467|      null|
// |  UJ0123QR84|UJ0123QR84|
// |UJ123-QR_846|UJ123QR846|
// +------------+----------+

Note that I've slightly expanded the sample dataset to cover cases such as non-leading 0, [/_-].

Upvotes: 1

Related Questions