Reputation: 41
I am working on parsing json to spark dataframe in scala. I have a nested json file of 50 different records of different household items. On JSON I am trying to parse the equipment tag is as below:
"equipment":[{"tv":[""]}]
Due to this item name (ex: tv
in this case) is becoming column name than values.
Ideally this tag should be like,
"equipment":["tv"]
Is there a way parse this type of JSON tags/ contents?
Due to this the dataframe schema is being shown as:
|-- equipment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ac: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- tv: array (nullable = true)
| | | |-- element: string (containsNull = true)
Where you can see that (above) ac
& tv
are becoming column headers. Instead of that i need them to shown as values. The dataframe should look like:
+----------+
|equipment |
+----------+
|tv |
|ac |
+----------+
Upvotes: 0
Views: 90
Reputation: 41987
A simple explode
function should have done the trick for you but looking at your schema
, two explode
functions would do the trick as
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
newdf.withColumn("equipment", explode(array($"equipment.*"))).show(false)
With these steps you should have the desired result as in the question.
Edited
From your comments it seems that you are trying to explode the fieldNames not the values. So the following code should work then
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
sc.parallelize(newdf.select("equipment.*").schema.fieldNames.toSeq).toDF("equipment").show(false)
Here's the complete code I am testing with
val data = Seq("""{"equipment":[{"tv":[""],"ac":[""]}]}""")
val dataframe = sqlContext.read.json(sc.parallelize(data))
dataframe.printSchema()
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
sc.parallelize(newdf.select("equipment.*").schema.fieldNames.toSeq).toDF("equipment").show(false)
the printed schema matches with yours
root
|-- equipment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ac: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- tv: array (nullable = true)
| | | |-- element: string (containsNull = true)
And the result I get matches with your expected result
+---------+
|equipment|
+---------+
|ac |
|tv |
+---------+
Upvotes: 0