Reputation: 29
I get as input to a function in scala a dataframe that has a column named vin. The column has values in the below format
1. UJ123QR8467
2. 0UJ123QR846
3. /UJ123QR8467
4. -UJ123QR8467
and so on.
The requirement is to clean the column vin based on the following rules.
1. replace **"/_-** as ""
2. replace first 0 as ""
3. if the value is more than 10 characters then make the value as NULL.
I would like to know is there any simplified way to achieve the above. I can only think of doing multiple .withcolumn during regex replace every time.
Upvotes: 0
Views: 2471
Reputation: 22439
I would combine all Regex
related changes in a single transformation and the length
condition in another, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
"UJ123QR8467", "0UJ123QR846", "/UJ123QR8467",
"-UJ123QR8467", "UJ0123QR84", "UJ123-QR_846"
).toDF("vin")
df.
withColumn("vin2", regexp_replace($"vin", "^[0]|[/_-]", "")).
withColumn("vin2", when(length($"vin2") <= 10, $"vin2")).
show
// +------------+----------+
// | vin| vin2|
// +------------+----------+
// | UJ123QR8467| null|
// | 00UJ123QR84|0UJ123QR84|
// |/UJ123QR8467| null|
// |-UJ123QR8467| null|
// | UJ0123QR84|UJ0123QR84|
// |UJ123-QR_846|UJ123QR846|
// +------------+----------+
Note that I've slightly expanded the sample dataset to cover cases such as non-leading 0
, [/_-]
.
Upvotes: 1