Reputation: 5390
I have the following json:
{
"value":[
{"C1":"val1","C2":"val2"},
{"C1":"val1","C2":"val2"},
{"C1":"val1","C2":"val2"}
]
}
That i am trying to read like this:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/Projects.json")
.show(10)
But it is not able to show me my records properly in the data frame, how do I go around that "value" nesting to properly have my rows in the dataframe?
The result I am trying to get is:
C1 | C2
-------------------
VAL1 | VAL2
VAL1 | VAL2
...etc
Upvotes: 1
Views: 1369
Reputation: 42392
Using inline
will do the job:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/Projects.json")
val df2 = df.selectExpr("inline(value)")
df2.show
+----+----+
| C1| C2|
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
Upvotes: 0
Reputation: 18495
Looking at the schema of the Dataframe (jsonDf
) returned by spark.read:
jsonDf.printSchema()
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- C1: string (nullable = true)
| | |-- C2: string (nullable = true)
you could use the sql function explode
and then select the two elements C1
and C2
as shown below:
val df = jsonDf
.withColumn("parsedJson", explode(col("value")))
.withColumn("C1", col("parsedJson.C1"))
.withColumn("C2", col("parsedJson.C2"))
.select(col("C1"), col("C2"))
.show(false)
This leads to the required outcome:
+----+----+
|C1 |C2 |
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
Upvotes: 1
Reputation: 5390
I finally managed to find a solution to my problem using the following function:
def flattenDataframe(df: DataFrame): DataFrame = {
val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
val length = fields.length
for(i <- 0 to fields.length-1){
val field = fields(i)
val fieldtype = field.dataType
val fieldName = field.name
fieldtype match {
case arrayType: ArrayType =>
val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
// val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
return flattenDataframe(explodedDf)
case structType: StructType =>
val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
val explodedf = df.select(renamedcols:_*)
return flattenDataframe(explodedf)
case _ =>
}
}
df
}
Source https://medium.com/@saikrishna_55717/flattening-nested-data-json-xml-using-apache-spark-75fa4c8ea2a7
Upvotes: 0