Darshan Mehta
Darshan Mehta

Reputation: 30849

Spark : Explode a pair of nested columns

Spark Version 2.1

I am reading a file into Spark dataframe with following format:

{
    "field1": "value1",
    "field2": "value2",
    "elements": [{
        "id": "1",
        "name": "a"
    },
    {
        "id": "2",
        "name": "b"
    },
    {
        "id": "3",
        "name": "c"
    }]
}

It contains an array of nested elements, now I want to explode the elements array to get the flat json structure. I am using the below code:

var dfExploded = df
    .withColumn("id", 
      explode(df.col("elements.id")))
    .withColumn("name",
      explode(df.col("elements.name")));

It seems to be returning cartesian product (e.g. I am getting 9 elements in the result whereas I only want 3). Is there any way I can specify a pair of nested columns to explode function?

Upvotes: 1

Views: 5137

Answers (1)

Assaf Mendelson
Assaf Mendelson

Reputation: 13001

If I understand correctly you would want you can try this:

 val exploded = df.withColumn("elements", explode($"elements"))
 exploded.show()
 >> +--------+------+------+
 >> |elements|field1|field2|
 >> +--------+------+------+
 >> |   [1,a]|value1|value2|
 >> |   [2,b]|value1|value2|
 >> |   [3,c]|value1|value2|
 >> +--------+------+------+

This first step basically turns the array into separate records. You need to do it on the array itself (elements) rather than elements inside

Next you can do

 val flattened = exploded.select("elements.*", "field1", "field2")
 flattened.show()
 >> +---+----+------+------+
 >> | id|name|field1|field2|
 >> +---+----+------+------+
 >> |  1|   a|value1|value2|
 >> |  2|   b|value1|value2|
 >> |  3|   c|value1|value2|
 >> +---+----+------+------+

To flatten the elements

Upvotes: 5

Related Questions