Reputation: 1634
I have a data frame :
+--------------------------------------+------------------------------------------------------------+
|item |item_codes |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+
And schema looks like this =
root
|-- item: string (nullable = true)
|-- item_codes: string (nullable = true)
How can I convert the column item_codes string to Array[String] in Scala ?
Upvotes: 0
Views: 321
Reputation: 1174
You can use the split method after doing some "preprocessing"
val col_names = Seq("item", "item_codes")
val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))
val df = spark.createDataFrame(data).toDF(col_names: _*)
// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema
If your format can change you might be more flexible using a regexp as leo suggestes chopping off everything that is not a digit or a ,
and split at ,
Upvotes: 0
Reputation: 22439
You can remove quotes/square brackets using regexp_replace
, followed by a split
to generate the ArrayType
column:
val df = Seq(
("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")
df.
withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
show(false)
// +----+------------------------------------------------------+
// |item|item_codes |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+
Upvotes: 1