Reputation: 31
I am new to Spark. All I want to do is, read nested jsons and group them based on a certain condition. For example: If the json contains details of a person like his city and zipcode. I would want to group the people who belong the same city and zipcode.
I have progressed till reading the jsons into DataSet. But I am not knowing how to group them.
My Nested JSON format is
{
"entity": {
"name": "SJ",
"id": 31
},
"hierarchy": {
"state": "TN",
"city": "CBE"
},
"data": {}}
This is the code I have written to read the nested json from file.
public void groupJsonString(SparkSession spark) {
Dataset<Row> studentRecordDS = spark.read()
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.json("/home/shiney/Documents/NGA/sparkJsonFiles/*.json");
StructType st = studentRecordDS.schema();
List<StructType> nestedList = new ArrayList<>();
for(StructField field : st.fields()) {
nestedList.add((StructType)field.dataType());
}
}
Upvotes: 0
Views: 285
Reputation: 74669
TL;DR Use spark.read.json
(as you did) followed by "flatten" operator in select
.
(I use Scala and leaving converting to Java as your home exercise :))
Let's use the sample of yours.
$ cat ../datasets/sample.json
{
"entity": {
"name": "SJ",
"id": 31
},
"hierarchy": {
"state": "TN",
"city": "CBE"
},
"data": {}
}
The code could be as follows (again it's Scala).
val entities = spark
.read
.option("multiLine", true)
.json("../datasets/sample.json")
scala> entities.printSchema
root
|-- entity: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
|-- hierarchy: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
Let's flatten entity
and hierarchy
top-level columns.
scala> entities.select("entity.*", "hierarchy.*").show
+---+----+----+-----+
| id|name|city|state|
+---+----+----+-----+
| 31| SJ| CBE| TN|
+---+----+----+-----+
Aggregation should be a no-brainer now.
Upvotes: 2