Reputation: 115
I have a nested JSON where I need to convert into flattened DataFrame without defining or exploding any column names in it.
val df = sqlCtx.read.option("multiLine",true).json("test.json")
So this is how my data looks like :
[
{
"symbol": “TEST3",
"timestamp": "2019-05-07 16:00:00",
"priceData": {
"open": "1177.2600",
"high": "1179.5500",
"low": "1176.6700",
"close": "1179.5500",
"volume": "49478"
}
},
{
"symbol": “TEST4",
"timestamp": "2019-05-07 16:00:00",
"priceData": {
"open": "189.5660",
"high": "189.9100",
"low": "189.5100",
"close": "189.9100",
"volume": "267986"
}
}
]
Upvotes: 0
Views: 653
Reputation: 7316
Here is one way using the DataFrameFlattener
class implemented by Databricks:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, StructType}
implicit class DataFrameFlattener(df: DataFrame) {
def flattenSchema: DataFrame = {
df.select(flatten(Nil, df.schema): _*)
}
protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
}
}
df.flattenSchema.show
And the output:
+---------------+--------------+-------------+--------------+----------------+------+-------------------+
|priceData.close|priceData.high|priceData.low|priceData.open|priceData.volume|symbol| timestamp|
+---------------+--------------+-------------+--------------+----------------+------+-------------------+
| 1179.5500| 1179.5500| 1176.6700| 1177.2600| 49478| TEST3|2019-05-07 16:00:00|
| 189.9100| 189.9100| 189.5100| 189.5660| 267986| TEST4|2019-05-07 16:00:00|
+---------------+--------------+-------------+--------------+----------------+------+-------------------+
Or you can just execute a normal select:
df.select(
"priceData.close",
"priceData.high",
"priceData.low",
"priceData.open",
"priceData.volume",
"symbol",
"timestamp").show
Output:
+---------+---------+---------+---------+------+------+-------------------+
| close| high| low| open|volume|symbol| timestamp|
+---------+---------+---------+---------+------+------+-------------------+
|1179.5500|1179.5500|1176.6700|1177.2600| 49478| TEST3|2019-05-07 16:00:00|
| 189.9100| 189.9100| 189.5100| 189.5660|267986| TEST4|2019-05-07 16:00:00|
+---------+---------+---------+---------+------+------+-------------------+
Upvotes: 1