Reputation: 49
How to parse below Input Json into key and value columns. Any help is appreciated.
Input:
{
"name" : "srini",
"value": {
"1" : "val1",
"2" : "val2",
"3" : "val3"
}
}
Output DataFrame Column:
name key value
-----------------------------
srini 1 val1
srini 2 val2
srini 3 val3
//++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Input DataFrame :
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json_file |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"file_path":"AAA/BBB.CCC.zip","file_name":"AAA_20200202122754.json","received_time":"2020-03-31","obj_cls":"Monitor","obj_cls_inst":"Monitor","relation_tree":"Source~>HD_Info~>Monitor","s_tag":"ABC1234","Monitor":{"Index":"0","Vendor_Data":"58F5Y","Monitor_Type":"Lenovo Monitor","HnfoID":"650FEC74"}}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
How to convert this above json file in a DataFrame like below :
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|file_path |file_name |received_time |obj_cls |obj_cls_inst |relation_tree |s_tag |attribute_name |attribute_value |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |Index |0 |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |Vendor_Data |58F5Y |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |Monitor_Type |Lenovo Monitor |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |HnfoID |650FEC74 |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
//**********************************************
val rawData = sparkSession.sql("select 1").withColumn("obj_cls", lit("First")).withColumn("s_tag", lit("S_12345")).withColumn("jsonString", lit("""{"id":""1,"First":{"Info":"ABCD123","Res":"5.2"}}"""))
Upvotes: 1
Views: 1620
Reputation: 7928
Once you have your json loaded into a DF as follows:
+-----+------------------+
| name| value|
+-----+------------------+
|srini|[val1, val2, val3]|
+-----+------------------+
First you select the whole values items:
df.select($"name", $"value.*")
This will give yo this:
+-----+----+----+----+
| name| 1| 2| 3|
+-----+----+----+----+
|srini|val1|val2|val3|
+-----+----+----+----+
Then you need to pivot the columns to become rows, for this I usually define a helper function kv:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
Then you create an array fo the desired columns:
val pivotCols = Array("1", "2", "3")
And finally apply the function to the previous DF:
df.select($"name", $"value.*")
.withColumn("kv", kv(pivotCols))
.select($"name", $"kv.k" as "key", $"kv.v" as "value")
Result:
+-----+---+-----+
| name|key|value|
+-----+---+-----+
|srini| 1| val1|
|srini| 2| val2|
|srini| 3| val3|
+-----+---+-----+
EDIT
If you don't wanna mannually specify the columns to pivot, you can use an intermediate df as follows:
val dfIntermediate = df.select($"name", $"value.*")
dfIntermediate.withColumn("kv", kv(dfIntermediate.columns.tail))
.select($"name", $"kv.k" as "key", $"kv.v" as "value")
And you will obtain the very same result:
+-----+---+-----+
| name|key|value|
+-----+---+-----+
|srini| 1| val1|
|srini| 2| val2|
|srini| 3| val3|
+-----+---+-----+
EDIT2
With the new example is the same, you just need to change which columns you read/pivot
val pivotColumns = Array("HnfoId", "Index", "Monitor_Type", "Vendor_Data")
df.select("file_path", "file_name", "received_time", "obj_cls", "obj_cls_inst", "relation_tree", "s_Tag", "Monitor.*").withColumn("kv", kv(pivotColumns)).select($"file_path", $"file_name", $"received_time", $"obj_cls", $"obj_cls_inst", $"relation_tree", $"s_Tag", $"kv.k" as "attribute_name", $"kv.v" as "attribute_value").show
+---------------+--------------------+-------------+-------+------------+--------------------+-------+--------------+---------------+
| file_path| file_name|received_time|obj_cls|obj_cls_inst| relation_tree| s_Tag|attribute_name|attribute_value|
+---------------+--------------------+-------------+-------+------------+--------------------+-------+--------------+---------------+
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| HnfoId| 650FEC74|
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| Index| 0|
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| Monitor_Type| Lenovo Monitor|
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| Vendor_Data| 58F5Y|
+---------------+--------------------+-------------+-------+------------+--------------------+-------+--------------+---------------+
Upvotes: 1