Henri Thorpe
Henri Thorpe

Reputation: 25

How can I capitalize specific words in a spark column?

I am trying to capitalize some words in a column in my spark dataframe. The words are all in a list.

val wrds = ["usa","gb"]

val dF = List(
    (1, "z",3, "Bob lives in the usa"),
    (4, "t", 2, "gb is where Beth lives"),
    (5, "t", 2, "ogb")
  ).toDF("id", "name", "thing", "country")

I would like to have an output of 

val dF = List(
    (1, "z",3, "Bob lives in the USA"),
    (4, "t", 2, "GB is where Beth lives")
    (5, "t", 2, "ogb")
  ).toDF("id", "name", "thing", "country")

It seems I have to do a string split on the column and then capitalize based on if that part of a string is present in the value. I am mostly struggling with row 3 where I do not want to capitalize ogb even though it does contain gb. Could anyone point me in the right direction?

Upvotes: 1

Views: 260

Answers (1)

memoryz
memoryz

Reputation: 494

import org.apache.spark.sql.functions._

val words = Array("usa","gb")

val df = List(
    (1, "z",3, "Bob lives in the usa"),
    (4, "t", 2, "gb is where Beth lives"),
    (5, "t", 2, "ogb")
  ).toDF("id", "name", "thing", "country")

val replaced = words.foldLeft(df){ 
  case (adf, word) => 
    adf.withColumn("country", regexp_replace($"country", "(\\b" + word + "\\b)", word.toUpperCase))
}

replaced.show

Output:

+---+----+-----+--------------------+
| id|name|thing|             country|
+---+----+-----+--------------------+
|  1|   z|    3|Bob lives in the USA|
|  4|   t|    2|GB is where Beth ...|
|  5|   t|    2|                 ogb|
+---+----+-----+--------------------+

Upvotes: 1

Related Questions