Subhajit Sen
Subhajit Sen

Reputation: 81

How to cast Array[Struct[String,String]] column type in Hive to Array[Map[String,String]]?

I've a column in a Hive table:

Column Name: Filters

Data Type:

 |-- filters: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: string (nullable = true)

I want to get the value from this column by it's corresponding name.

What I did so far:

val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")

val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()

Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))

Note: Casting to Seq because WrappedArray to Map conversion is not possible.

What to do next?

Upvotes: 1

Views: 5594

Answers (1)

Vitalii Kotliarenko
Vitalii Kotliarenko

Reputation: 2967

I want to get the value from this column by it's corresponding name.

If you want simple and reliable way to get all values by name, you may flatten your table using explode and filter:

case class Data(name: String, value: String)
case class Filters(filters: Array[Data])

val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
|       filters|
+--------------+
|[[a,b], [a,c]]|
|       [[b,c]]|
+--------------+

df.withColumn("filter", explode($"filters"))
  .select($"filter.name" as "name", $"filter.value" as "value")
  .where($"name" === "a")
  .show()
+----+-----+
|name|value|
+----+-----+
|   a|    b|
|   a|    c|
+----+-----+

You can also collect your data any way you want:

val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not  the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))

If you want to cast array[struct] to map[string, string] for future saving to some storage - it's different story, and this case is better solved by UDF. Anyway, you have to avoid collect() as long as it possible to keep your code scalable.

Upvotes: 1

Related Questions