Reputation: 355
Im loading the below JSON string into a dataframe
column.
{
"title": {
"titleid": "222",
"titlename": "ABCD"
},
"customer": {
"customerDetail": {
"customerid": 878378743,
"customerstatus": "ACTIVE",
"customersystems": {
"customersystem1": "SYS01",
"customersystem2": null
},
"sysid": null
},
"persons": [{
"personid": "123",
"personname": "IIISKDJKJSD"
},
{
"personid": "456",
"personname": "IUDFIDIKJK"
}]
}
}
val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)
//Schema:
StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))
//Output:
+------------------------------+---------------------------------------+
|customerDetail |persons |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+
My Question: Is there a way that I can split the key value as a separate dataframe columns
like below
by keeping the Array columns
as is since I need to have only one record per json string
:
Example for customer column
:
customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }
Upvotes: 1
Views: 1521
Reputation: 2518
Edited post :
val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
sc match {
case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
case _ => Seq(field)
}
}
val fields = collectFields("",df.schema).map(_.tail)
df.select(fields.map(col):_*).show(false)
Output :
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE |SYS01 |null |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222 |ABCD |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
Upvotes: 3
Reputation: 140
You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.
Upvotes: 0