user1124702
user1124702

Reputation: 1135

Writing out spark dataframe as nested JSON doc

I have a spark dataframe as:

A      B       val_of_B    val1   val2  val3   val4
"c1"  "MCC"     "cd1"      1      2     1.1    1.05
"c1"  "MCC"     "cd2"      2      3     1.1    1.05
"c1"  "MCC"     "cd3"      3      4     1.1    1.05

val1 and val2 are obtained with group by of A, B and val_of_B where as val3, val4 is A level information only (for example, distinct of A, val3 is only "c1",1.1)

I would like to write this out as nested JSON, which should look like:

For each A, JSON format should look like

{"val3": 1.1, "val4": 1.05, "MCC":[["cd1",1,2], ["cd2",2,3], ["cd3",3,4]]}

Is it possible to accomplish this with existing tools under spark api? If not, can you provide guidelines?

Upvotes: 3

Views: 3535

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You should groupBy on column A and aggregate necessary columns using first and collect_list and array inbuilt functions

import org.apache.spark.sql.functions._
def zipping = udf((arr1: Seq[String], arr2: Seq[Seq[String]])=> arr1.indices.map(index => Array(arr1(index))++arr2(index)))
val jsonDF = df.groupBy("A")
  .agg(first(col("val3")).as("val3"), first(col("val4")).as("val4"), first(col("B")).as("B"), collect_list("val_of_B").as("val_of_B"), collect_list(array("val1", "val2")).as("list"))
  .select(col("val3"), col("val4"), col("B"), zipping(col("val_of_B"), col("list")).as("list"))
  .toJSON

which should give you

    +-----------------------------------------------------------------------------------------------+
|value                                                                                          |
+-----------------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","B":"MCC","list":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+-----------------------------------------------------------------------------------------------+

Next is to exchange the list name to value of B using a udf function as

def exchangeName = udf((json: String)=> {
  val splitted = json.split(",")
  val name = splitted(2).split(":")(1).trim
  val value = splitted(3).split(":")(1).trim
  splitted(0).trim+","+splitted(1).trim+","+name+":"+value+","+(4 until splitted.size).map(splitted(_)).mkString(",")
})

jsonDF.select(exchangeName(col("value")).as("json"))
  .show(false)

which should give you your desired output

+------------------------------------------------------------------------------------+
|json                                                                                |
+------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","MCC":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+------------------------------------------------------------------------------------+

Upvotes: 5

Related Questions