Reputation: 1135
I have a spark dataframe as:
A B val_of_B val1 val2 val3 val4
"c1" "MCC" "cd1" 1 2 1.1 1.05
"c1" "MCC" "cd2" 2 3 1.1 1.05
"c1" "MCC" "cd3" 3 4 1.1 1.05
val1 and val2 are obtained with group by of A, B and val_of_B where as val3, val4 is A level information only (for example, distinct of A, val3 is only "c1",1.1)
I would like to write this out as nested JSON, which should look like:
For each A, JSON format should look like
{"val3": 1.1, "val4": 1.05, "MCC":[["cd1",1,2], ["cd2",2,3], ["cd3",3,4]]}
Is it possible to accomplish this with existing tools under spark api? If not, can you provide guidelines?
Upvotes: 3
Views: 3535
Reputation: 41957
You should groupBy
on column A and aggregate
necessary columns using first
and collect_list
and array
inbuilt functions
import org.apache.spark.sql.functions._
def zipping = udf((arr1: Seq[String], arr2: Seq[Seq[String]])=> arr1.indices.map(index => Array(arr1(index))++arr2(index)))
val jsonDF = df.groupBy("A")
.agg(first(col("val3")).as("val3"), first(col("val4")).as("val4"), first(col("B")).as("B"), collect_list("val_of_B").as("val_of_B"), collect_list(array("val1", "val2")).as("list"))
.select(col("val3"), col("val4"), col("B"), zipping(col("val_of_B"), col("list")).as("list"))
.toJSON
which should give you
+-----------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","B":"MCC","list":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+-----------------------------------------------------------------------------------------------+
Next is to exchange the list
name to value of B
using a udf
function as
def exchangeName = udf((json: String)=> {
val splitted = json.split(",")
val name = splitted(2).split(":")(1).trim
val value = splitted(3).split(":")(1).trim
splitted(0).trim+","+splitted(1).trim+","+name+":"+value+","+(4 until splitted.size).map(splitted(_)).mkString(",")
})
jsonDF.select(exchangeName(col("value")).as("json"))
.show(false)
which should give you your desired output
+------------------------------------------------------------------------------------+
|json |
+------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","MCC":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+------------------------------------------------------------------------------------+
Upvotes: 5