user3776800
user3776800

Reputation: 57

Scala spark: extract columns from a schema

I have a schema that looks following:

 |-- contributors: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: array (valueContainsNull = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)

I would like to have a dataframe that have the columns key, name and id

I have used the following code to get name and id but how do I get the column key?

df.select(explode(col("contributors")))
  .select(explode(col("value")))
  .select(col("col.*"))

Update

I tried to apply the first solution to the following schema but the compiler does not like it. I would like to get value._name and subgenres.element.value._name

         |-- mainGenre: struct (nullable = true)
         |    |-- value: struct (nullable = true)
         |    |    |-- _name: string (nullable = true)
         |    |-- subgenres: array (nullable = true)
         |    |    |-- element: struct (containsNull = true)
         |    |    |    |-- value: struct (nullable = true)
         |    |    |    |    |-- type: string (nullable = true))
         |    |    |    |    |-- _name: string (nullable = true)
         |    |    |    |-- name: map (nullable = true)
         |    |    |    |    |-- key: string
         |    |    |    |    |-- value: string (valueContainsNull = true)

I tried to create a variable with value._name and then insert it in my second variable like this.

val col_mainGenre_name = df_r.select(col("mainGenre.*"))
                             .select(col("value.*"))
                             .select(col("_name"))
                             .drop("readableName")
                             .drop("description")

val df_exploded = df_r.select(col("mainGenre.*"))
                      .select(col_mainGenre_name, col("value.*"))

Upvotes: 1

Views: 406

Answers (1)

Vincent Doba
Vincent Doba

Reputation: 5078

You can add key column in your second and third select. select method of dataframe accepts several columns as argument.

You should modify your code as follows:

import org.apache.spark.sql.functions.{col, explode}

df.select(explode(col("contributors")))
  .select(col("key"), explode(col("value")))
  .select(col("key"), col("col.*"))

With the following contributors input column:

+--------------------------------------------------------------------------------------------+
|contributors                                                                                |
+--------------------------------------------------------------------------------------------+
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []}                      |
+--------------------------------------------------------------------------------------------+

You get the following ouput:

+----+------+------+----+
|key |type  |name  |id  |
+----+------+------+----+
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
+----+------+------+----+

if you want to keep only name and id columns from value, you should also modify the last select to select only col.id and col.name columns:

import org.apache.spark.sql.functions.{col, explode}

df.select(explode(col("contributors")))
  .select(col("key"), explode(col("value")))
  .select(col("key"), col("col.name"), col("col.id"))

With the same contributors column input, you get your expected ouput:

+----+------+----+
|key |name  |id  |
+----+------+----+
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
+----+------+----+

Upvotes: 2

Related Questions