Reputation: 87
Here is my Nested JSON file.
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}
How can I read the JSON file into Dataframe with Spark Scala. There is no array object in the JSON file, so I can't use explode. Can anyone help?
Upvotes: 2
Views: 11150
Reputation: 1765
val df = spark.read.option("multiline", true).json("myfile.json")
df.select($"dc_id", explode(array("source.*")))
.select($"dc_id", $"col.c02_level", $"col.description", $"col.geo.lat", $"col.geo.long", $"col.id", $"col.ip", $"col.temp")
.show(false)
Output:
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc_id |c02_level|description |lat |long |id |ip |temp|
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc-101|1475 |Sensor attached to the container ceilings |38.0 |97.0 |10 |68.28.91.22 |35 |
|dc-101|1346 |Sensor attached to the factory ceilings |33.61|-111.89|8 |208.109.163.218|40 |
|dc-101|1370 |Sensor ipad attached to carbon cylinders |47.41|-122.0 |13 |67.185.72.1 |34 |
|dc-101|1574 |Sensor embedded in exhaust pipes in the ceilings|35.93|-85.46 |5 |204.116.105.67 |40 |
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
Upvotes: 0
Reputation: 267
val df = spark.read.option("multiline", true).json("data/test.json")
df
.select(col("dc_id"), explode(array("source.*")) as "level1")
.withColumn("id", col("level1.id"))
.withColumn("ip", col("level1.ip"))
.withColumn("temp", col("level1.temp"))
.withColumn("description", col("level1.description"))
.withColumn("c02_level", col("level1.c02_level"))
.withColumn("lat", col("level1.geo.lat"))
.withColumn("long", col("level1.geo.long"))
.drop("level1")
.show(false)
Sample Output:
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc_id |id |ip |temp|description |c02_level|lat |long |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc-101|10 |68.28.91.22 |35 |Sensor attached to the container ceilings |1475 |38.0 |97.0 |
|dc-101|8 |208.109.163.218|40 |Sensor attached to the factory ceilings |1346 |33.61|-111.89|
|dc-101|13 |67.185.72.1 |34 |Sensor ipad attached to carbon cylinders |1370 |47.41|-122.0 |
|dc-101|5 |204.116.105.67 |40 |Sensor embedded in exhaust pipes in the ceilings|1574 |35.93|-85.46 |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
Instead of selecting each column, you can try writing some generic UDF to get all the individual columns.
Note: Tested with Spark 2.3
Upvotes: 3
Reputation: 1345
Taken the string into a variable called jsonString
import org.apache.spark.sql._
import spark.implicits._
val df = spark.read.json(Seq(jsonString).toDS)
val df1 = df.withColumn("lat" ,explode(array("source.sensor-igauge.geo.lat")))
You can follow the same steps for other structures as well - map/ array structures
Upvotes: 1