Reputation: 57
I have a schema that looks following:
|-- contributors: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- type: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- id: string (nullable = true)
I would like to have a dataframe that have the columns key
, name
and id
I have used the following code to get name
and id
but how do I get the column key
?
df.select(explode(col("contributors")))
.select(explode(col("value")))
.select(col("col.*"))
Update
I tried to apply the first solution to the following schema but the compiler does not like it. I would like to get value._name
and subgenres.element.value._name
|-- mainGenre: struct (nullable = true)
| |-- value: struct (nullable = true)
| | |-- _name: string (nullable = true)
| |-- subgenres: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- value: struct (nullable = true)
| | | | |-- type: string (nullable = true))
| | | | |-- _name: string (nullable = true)
| | | |-- name: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
I tried to create a variable with value._name
and then insert it in my second variable like this.
val col_mainGenre_name = df_r.select(col("mainGenre.*"))
.select(col("value.*"))
.select(col("_name"))
.drop("readableName")
.drop("description")
val df_exploded = df_r.select(col("mainGenre.*"))
.select(col_mainGenre_name, col("value.*"))
Upvotes: 1
Views: 406
Reputation: 5078
You can add key
column in your second and third select. select
method of dataframe accepts several columns as argument.
You should modify your code as follows:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.*"))
With the following contributors
input column:
+--------------------------------------------------------------------------------------------+
|contributors |
+--------------------------------------------------------------------------------------------+
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []} |
+--------------------------------------------------------------------------------------------+
You get the following ouput:
+----+------+------+----+
|key |type |name |id |
+----+------+------+----+
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
+----+------+------+----+
if you want to keep only name
and id
columns from value, you should also modify the last select
to select only col.id
and col.name
columns:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.name"), col("col.id"))
With the same contributors
column input, you get your expected ouput:
+----+------+----+
|key |name |id |
+----+------+----+
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
+----+------+----+
Upvotes: 2