user1870400
user1870400

Reputation: 6364

How to parse string to array in Spark?

How to flatten Array of Strings into multiple rows of a dataframe in Spark 2.2.0?

Input Row ["foo", "bar"]

val inputDS = Seq("""["foo", "bar"]""").toDF

inputDS.printSchema()

root
 |-- value: string (nullable = true)

Input Dataset inputDS

inputDS.show(false)

value
-----
["foo", "bar"]

Expected output dataset outputDS

value
-------
"foo" |
"bar" |

I tried explode function like below but it didn't quite work

inputDS.select(explode(from_json(col("value"), ArrayType(StringType))))

and I get the following error

org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`value`)' due to data type mismatch: Input schema string must be a struct or an array of structs

Also tried the following

inputDS.select(explode(col("value")))

And I get the following error

org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`value`)' due to data type mismatch: input to function explode should be array or map type, not StringType

Upvotes: 3

Views: 8694

Answers (3)

Vignesh I
Vignesh I

Reputation: 2221

You can simply achieve using flatMap.

val input = sc.parallelize(Array("foo", "bar")).toDS()
val out = input.flatMap(x => x.split(","))
out.collect.foreach{println}

Upvotes: -2

Greg
Greg

Reputation: 11

The issue above should be fixed in Spark 2.4.0 (https://jira.apache.org/jira/browse/SPARK-24391) So you can use this from_json($"column_nm", ArrayType(StringType)) without any problems.

Upvotes: 0

Alper t. Turker
Alper t. Turker

Reputation: 35249

Exception is thrown by:

from_json(col("value"), ArrayType(StringType))

not explode, specifically:

Input schema array must be a struct or an array of structs.

You can:

inputDS.selectExpr(
  "split(substring(value, 2, length(value) - 2), ',\\s+') as value")

and explode the output.

Upvotes: 7

Related Questions