How to read nested JSONs for aggregations?

Question

I am new to Spark. All I want to do is, read nested jsons and group them based on a certain condition. For example: If the json contains details of a person like his city and zipcode. I would want to group the people who belong the same city and zipcode.

I have progressed till reading the jsons into DataSet. But I am not knowing how to group them.

My Nested JSON format is

{
  "entity": {
    "name": "SJ",
    "id": 31
  },
  "hierarchy": {
    "state": "TN",
    "city": "CBE"
  },
  "data": {}}

This is the code I have written to read the nested json from file.

public void groupJsonString(SparkSession spark) {
    Dataset studentRecordDS = spark.read()
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
            .json("/home/shiney/Documents/NGA/sparkJsonFiles/*.json");
    StructType st = studentRecordDS.schema();


    List nestedList = new ArrayList<>();
    for(StructField field : st.fields()) {
        nestedList.add((StructType)field.dataType());
    }   

}

Jacek Laskowski · Accepted Answer

TL;DR Use spark.read.json (as you did) followed by "flatten" operator in select.

(I use Scala and leaving converting to Java as your home exercise :))

Let's use the sample of yours.

$ cat ../datasets/sample.json
{
  "entity": {
    "name": "SJ",
    "id": 31
  },
  "hierarchy": {
    "state": "TN",
    "city": "CBE"
  },
  "data": {}
}

The code could be as follows (again it's Scala).

val entities = spark
  .read
  .option("multiLine", true)
  .json("../datasets/sample.json")
scala> entities.printSchema
root
 |-- entity: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |-- hierarchy: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)

Let's flatten entity and hierarchy top-level columns.

scala> entities.select("entity.*", "hierarchy.*").show
+---+----+----+-----+
| id|name|city|state|
+---+----+----+-----+
| 31|  SJ| CBE|   TN|
+---+----+----+-----+

Aggregation should be a no-brainer now.

How to read nested JSONs for aggregations?

Answers (1)

Related Questions