user3407267
user3407267

Reputation: 1634

How to covert a column with String to Array[String] in Scala/Spark?

I have a data frame :

+--------------------------------------+------------------------------------------------------------+
|item                                  |item_codes                                               |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+

And schema looks like this =

root
 |-- item: string (nullable = true)
 |-- item_codes: string (nullable = true)

How can I convert the column item_codes string to Array[String] in Scala ?

Upvotes: 0

Views: 321

Answers (2)

Paul
Paul

Reputation: 1174

You can use the split method after doing some "preprocessing"

val col_names = Seq("item", "item_codes")

val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))

val df = spark.createDataFrame(data).toDF(col_names: _*)

// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema

If your format can change you might be more flexible using a regexp as leo suggestes chopping off everything that is not a digit or a , and split at ,

Upvotes: 0

Leo C
Leo C

Reputation: 22439

You can remove quotes/square brackets using regexp_replace, followed by a split to generate the ArrayType column:

val df = Seq(
  ("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")

df.
  withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
  show(false)
// +----+------------------------------------------------------+
// |item|item_codes                                            |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+

Upvotes: 1

Related Questions