Defcon
Defcon

Reputation: 817

Converting Array of Strings to String with different delimiters in Spark Scala

I want to convert an array of String in a dataframe to a String with different delimiters than a comma also removing the array bracket. I want the "," to be replaced with ";#". This is to avoid elements that may have "," inside as it is a freeform text field. I am using spark 1.6.

Examples below:

Schema:

root
 |-- carLineName: array (nullable = true)
 |    |-- element: string (containsNull = true)

Input as Dataframe:

   +--------------------+
    |carLineName         |
    +--------------------+
    |[Avalon,CRV,Camry]  |
    |[Model T, Model S]  |
    |[Cayenne, Mustang]  |
    |[Pilot, Jeep]       |

Desired output:

+--------------------+
|carLineName         |
+--------------------+
|Avalon;#CRV;#Camry  | 
|Model T;#Model S    |
|Cayenne;#Mustang    |
|Pilot;# Jeep        |

Current code which produces the input above:

val newCarDf = carDf.select(col("carLineName").cast("String").as("carLineName"))

Upvotes: 0

Views: 3371

Answers (4)

SanBan
SanBan

Reputation: 655

Since you are using 1.6, we can do simple map of Row to WrappedArray.

Here is how it goes.

Input :

scala> val carLineDf = Seq( (Array("Avalon","CRV","Camry")),
                   |                   (Array("Model T", "Model S")),
                   |                   (Array("Cayenne", "Mustang")),
                   |                   (Array("Pilot", "Jeep"))
                   |                   ).toDF("carLineName")
              carLineDf: org.apache.spark.sql.DataFrame = [carLineName: array<string>]

Schema ::

              scala> carLineDf.printSchema
              root
               |-- carLineName: array (nullable = true)
               |    |-- element: string (containsNull = true)

Then we just use Row.getAs to get an WrappedArray of String instead of a Row object and we can manipulate with usual scala built-ins :

scala> import scala.collection.mutable.WrappedArray
import scala.collection.mutable.WrappedArray

scala> carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( a => a.mkString(";#")).toDF("carLineNameAsString").show(false)
+-------------------+
|carLineNameAsString|
+-------------------+
|Avalon;#CRV;#Camry |
|Model T;#Model S   |
|Cayenne;#Mustang   |
|Pilot;#Jeep        |
+-------------------+

// Even an easier alternative
carLineDf.map( row => row.getAs[WrappedArray[String]](0)).map( r => r.reduce(_+";#"+_)).show(false)

That's it. You might have to use a dataframe.rdd otherwise this should do.

Upvotes: 0

David Vrba
David Vrba

Reputation: 3344

You can use native function array_join (it is available since Spark 2.4):

import org.apache.spark.sql.functions.{array_join}

val l = Seq(Seq("Avalon","CRV","Camry"), Seq("Model T", "Model S"), Seq("Cayenne", "Mustang"), Seq("Pilot", "Jeep"))
val df = l.toDF("carLineName")

df.withColumn("str", array_join($"carLineName", ";#")).show()

+--------------------+------------------+
|         carLineName|               str|
+--------------------+------------------+
|[Avalon, CRV, Camry]|Avalon;#CRV;#Camry|
|  [Model T, Model S]|  Model T;#Model S|
|  [Cayenne, Mustang]|  Cayenne;#Mustang|
|       [Pilot, Jeep]|       Pilot;#Jeep|
+--------------------+------------------+

Upvotes: 4

firsni
firsni

Reputation: 916

you can create a user defined function that concatenate elements with "#;" separator as the following example:

val df1  = Seq(
  ("1", Array("t1", "t2")),
  ("2", Array("t1", "t3", "t5"))
).toDF("id", "arr")

import org.apache.spark.sql.functions.{col, udf}
def formatString: Seq[String] => String = x => x.reduce(_ ++ "#;" ++ _)
def udfFormat = udf(formatString)
df1.withColumn("formatedColumn", udfFormat(col("arr")))


+---+------------+----------+
| id|         arr|  formated|
+---+------------+----------+
|  1|    [t1, t2]|    t1#;t2|
|  2|[t1, t3, t5]|t1#;t3#;t5|
+---+------------+----------+

Upvotes: 3

hagarwal
hagarwal

Reputation: 1163

You could simply write an User-defined function udf, which will take an Array of String as input parameter. Inside udf any operation could be performed on an array.

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def toCustomString: UserDefinedFunction = udf((carLineName: Seq[String]) => {
  carLineName.mkString(";#")
})
val newCarDf = df.withColumn("carLineName", toCustomString(df.col("carLineName")))

This udf could be made generic further by passing the delimiter as the second parameter.

import org.apache.spark.sql.functions.lit
def toCustomStringWithDelimiter: UserDefinedFunction = udf((carLineName: Seq[String], delimiter: String) => {
  carLineName.mkString(delimiter)
})
val newCarDf = df.withColumn("carLineName", toCustomStringWithDelimiter(df.col("carLineName"), lit(";#")))

Upvotes: 0

Related Questions