Leibnitz
Leibnitz

Reputation: 355

Spark: Splitting JSON strings into separate dataframe columns

Im loading the below JSON string into a dataframe column.

{
    "title": {
        "titleid": "222",
        "titlename": "ABCD"
    },
    "customer": {
        "customerDetail": {
            "customerid": 878378743,
            "customerstatus": "ACTIVE",
            "customersystems": {
                "customersystem1": "SYS01",
                "customersystem2": null
            },
            "sysid": null
        },
        "persons": [{
            "personid": "123",
            "personname": "IIISKDJKJSD"
        },
        {
            "personid": "456",
            "personname": "IUDFIDIKJK"
        }]
    }
}

val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)

//Schema:

StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))

//Output:

+------------------------------+---------------------------------------+
|customerDetail                |persons                                |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+

My Question: Is there a way that I can split the key value as a separate dataframe columns like below by keeping the Array columns as is since I need to have only one record per json string:

Example for customer column:

customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }

Upvotes: 1

Views: 1521

Answers (2)

baitmbarek
baitmbarek

Reputation: 2518

Edited post :

val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
  sc match {
    case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
    case _ => Seq(field)
  }
}

val fields = collectFields("",df.schema).map(_.tail)

df.select(fields.map(col):_*).show(false)

Output :

+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons                              |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE        |SYS01          |null           |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222    |ABCD     |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+

Upvotes: 3

Awaldeep Singh
Awaldeep Singh

Reputation: 140

You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.

Upvotes: 0

Related Questions