Reputation: 30849
Spark Version 2.1
I am reading a file into Spark dataframe
with following format:
{
"field1": "value1",
"field2": "value2",
"elements": [{
"id": "1",
"name": "a"
},
{
"id": "2",
"name": "b"
},
{
"id": "3",
"name": "c"
}]
}
It contains an array
of nested elements, now I want to explode
the elements
array to get the flat json structure. I am using the below code:
var dfExploded = df
.withColumn("id",
explode(df.col("elements.id")))
.withColumn("name",
explode(df.col("elements.name")));
It seems to be returning cartesian product (e.g. I am getting 9 elements in the result whereas I only want 3). Is there any way I can specify a pair of nested columns to explode
function?
Upvotes: 1
Views: 5137
Reputation: 13001
If I understand correctly you would want you can try this:
val exploded = df.withColumn("elements", explode($"elements"))
exploded.show()
>> +--------+------+------+
>> |elements|field1|field2|
>> +--------+------+------+
>> | [1,a]|value1|value2|
>> | [2,b]|value1|value2|
>> | [3,c]|value1|value2|
>> +--------+------+------+
This first step basically turns the array into separate records. You need to do it on the array itself (elements) rather than elements inside
Next you can do
val flattened = exploded.select("elements.*", "field1", "field2")
flattened.show()
>> +---+----+------+------+
>> | id|name|field1|field2|
>> +---+----+------+------+
>> | 1| a|value1|value2|
>> | 2| b|value1|value2|
>> | 3| c|value1|value2|
>> +---+----+------+------+
To flatten the elements
Upvotes: 5