Reputation: 35
I am currently using Apache Spark 2.1.1 to process an XML file into a CSV. My goal is to flatten the XML but the problem I am currently facing is unbounded occurrences of elements. Spark automatically infer these unbounded occurrences into array. Now what I want to do is explode an array column.
Sample Schema
|-- Instrument_XREF_Identifier: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- @bsid: string (nullable = true)
| | |-- @exch_code: string (nullable = true)
| | |-- @id_bb_sec_num: string (nullable = true)
| | |-- @market_sector: string (nullable = true)
I know I can explode the array by this method
result = result.withColumn(p.name, explode(col(p.name)))
which will produce multiple rows with each array value containing struct. But the output I want to produce is to explode it into multiple columns instead of row.
Here is my expected output according to the schema I mentioned above:
Lets say that there are two struct values in the array.
bsid1 exch_code1 id_bb_sec_num1 market_sector1 bsid2 exch_code2 id_bb_sec_num2 market_sector2
123 3 1 13 234 12 212 221
Upvotes: 2
Views: 2620
Reputation: 27373
suppose Instrument_XREF_Identifier
is a column of type array<struct<..>>
, then you have to do it in two steps:
result
.withColumn("tmp",explode(col("Instrument_XREF_Identifier")))
.select("tmp.*")
This will give you a column for each of the struct elements.
There seems not to be a way to do it in 1 select/withColumn statement, see Explode array of structs to columns in Spark
Upvotes: 2