Anji
Anji

Reputation: 41

Need help Parsing strange JSON with scala

I am working on parsing json to spark dataframe in scala. I have a nested json file of 50 different records of different household items. On JSON I am trying to parse the equipment tag is as below:

"equipment":[{"tv":[""]}]

Due to this item name (ex: tv in this case) is becoming column name than values.

Ideally this tag should be like,

"equipment":["tv"]

Is there a way parse this type of JSON tags/ contents?

Due to this the dataframe schema is being shown as:

 |-- equipment: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ac: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- tv: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

Where you can see that (above) ac & tv are becoming column headers. Instead of that i need them to shown as values. The dataframe should look like:

+----------+
|equipment |
+----------+
|tv        |
|ac        |
+----------+

Upvotes: 0

Views: 90

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

A simple explode function should have done the trick for you but looking at your schema, two explode functions would do the trick as

val newdf = dataframe.withColumn("equipment", explode($"equipment"))
newdf.withColumn("equipment", explode(array($"equipment.*"))).show(false)

With these steps you should have the desired result as in the question.

Edited

From your comments it seems that you are trying to explode the fieldNames not the values. So the following code should work then

val newdf = dataframe.withColumn("equipment", explode($"equipment"))
sc.parallelize(newdf.select("equipment.*").schema.fieldNames.toSeq).toDF("equipment").show(false)

Here's the complete code I am testing with

val data = Seq("""{"equipment":[{"tv":[""],"ac":[""]}]}""")
val dataframe = sqlContext.read.json(sc.parallelize(data))
dataframe.printSchema()
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
sc.parallelize(newdf.select("equipment.*").schema.fieldNames.toSeq).toDF("equipment").show(false)

the printed schema matches with yours

root
 |-- equipment: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ac: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- tv: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

And the result I get matches with your expected result

+---------+
|equipment|
+---------+
|ac       |
|tv       |
+---------+

Upvotes: 0

Related Questions