Lenny D.
Lenny D.

Reputation: 298

Unable to groupBy MapType column within Spark DataFrame

My current issue is the following one...

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'mapField' cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.;;

What I'm trying to achieve is just basically group entries within a DataFrame by a given set of columns, but seems to be failing when grouping with MapType columns such as the previously mentioned.

  .groupBy(
    ...
    "mapField",
    ...
  )

I've got a couple of ideas but there must be a way easier solution to this problem rather than the following ones that I've thought about...

EDIT

Example input

   id    |  myMap
'sample' |  Map('a' -> 1, 'b' -> 2, 'c' -> 3)

Desired output

   id    |  a  |  b  |  c
'sample' |  1  |  2  |  3

Upvotes: 5

Views: 3764

Answers (1)

abiratsis
abiratsis

Reputation: 7316

As the error suggests map<string,string> is not an orderable data type, you will need to represent the map with an orderable type. One of such type is array, therefore we can use map_values and map_keys to extract the map data into 2 different fields, as shown below:

import org.apache.spark.sql.functions.{map_values, map_keys}
val df = Seq(
    (Map("k1"->"v1"), 12),
    (Map("k2"->"v2"), 11),
    (null, 10) 
).toDF("map", "id")

df.select(map_values($"map").as("map_values")).show

// +----------+
// |map_values|
// +----------+
// |      [v1]|
// |      [v2]|
// |      null|
// +---------------+

df.select(map_keys($"map").as("map_keys")).show

// +--------+
// |map_keys|
// +--------+
// |    [k1]|
// |    [k2]|
// |    null|
// +--------+

Then you can use it directly with groupBy:

df.groupBy("map_keys").count()

And a generic modular solution in order to use it multiple times:

def unwrapMap(mapField: String): DataFrame => DataFrame = { df =>
    df.withColumn(s"${mapField}_keys", map_keys(df(mapField)))
      .withColumn(s"${mapField}_values", map_values(df(mapField)))
      .drop(df(mapField))
  }

Usage: df.transform(unwrapMap("map_field"))

Upvotes: 1

Related Questions