Reputation: 6364
How to flatten Array of Strings into multiple rows of a dataframe in Spark 2.2.0?
Input Row ["foo", "bar"]
val inputDS = Seq("""["foo", "bar"]""").toDF
inputDS.printSchema()
root
|-- value: string (nullable = true)
Input Dataset inputDS
inputDS.show(false)
value
-----
["foo", "bar"]
Expected output dataset outputDS
value
-------
"foo" |
"bar" |
I tried explode
function like below but it didn't quite work
inputDS.select(explode(from_json(col("value"), ArrayType(StringType))))
and I get the following error
org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`value`)' due to data type mismatch: Input schema string must be a struct or an array of structs
Also tried the following
inputDS.select(explode(col("value")))
And I get the following error
org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`value`)' due to data type mismatch: input to function explode should be array or map type, not StringType
Upvotes: 3
Views: 8694
Reputation: 2221
You can simply achieve using flatMap.
val input = sc.parallelize(Array("foo", "bar")).toDS()
val out = input.flatMap(x => x.split(","))
out.collect.foreach{println}
Upvotes: -2
Reputation: 11
The issue above should be fixed in Spark 2.4.0 (https://jira.apache.org/jira/browse/SPARK-24391)
So you can use this from_json($"column_nm", ArrayType(StringType))
without any problems.
Upvotes: 0
Reputation: 35249
Exception is thrown by:
from_json(col("value"), ArrayType(StringType))
not explode
, specifically:
Input schema array must be a struct or an array of structs.
You can:
inputDS.selectExpr(
"split(substring(value, 2, length(value) - 2), ',\\s+') as value")
and explode
the output.
Upvotes: 7