Reputation: 782
I'm very new to spark and i'm trying to parse a json file containing data to be aggregated but i can't manage to navigate its content. I searched for for other solutions but i wasn't able to find anything that worked in my case.
This is the schema of the dataframe of imported json:
root
|-- UrbanDataset: struct (nullable = true)
| |-- context: struct (nullable = true)
| | |-- coordinates: struct (nullable = true)
| | | |-- format: string (nullable = true)
| | | |-- height: long (nullable = true)
| | | |-- latitude: double (nullable = true)
| | | |-- longitude: double (nullable = true)
| | |-- language: string (nullable = true)
| | |-- producer: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| | |-- timestamp: string (nullable = true)
| |-- specification: struct (nullable = true)
| | |-- id: struct (nullable = true)
| | | |-- schemeID: string (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- propertyDefinition: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- codeList: string (nullable = true)
| | | | | |-- dataType: string (nullable = true)
| | | | | |-- propertyDescription: string (nullable = true)
| | | | | |-- propertyName: string (nullable = true)
| | | | | |-- subProperties: struct (nullable = true)
| | | | | | |-- propertyName: array (nullable = true)
| | | | | | | |-- element: string (containsNull = true)
| | | | | |-- unitOfMeasure: string (nullable = true)
| | |-- uri: string (nullable = true)
| | |-- version: string (nullable = true)
| |-- values: struct (nullable = true)
| | |-- line: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- coordinates: struct (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- height: double (nullable = true)
| | | | | |-- latitude: double (nullable = true)
| | | | | |-- longitude: double (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- period: struct (nullable = true)
| | | | | |-- end_ts: string (nullable = true)
| | | | | |-- start_ts: string (nullable = true)
| | | | |-- property: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- name: string (nullable = true)
| | | | | | |-- val: string (nullable = true)
A subset of the whole json is attached here
My goal is to retrieve the values struct from this schema and manipulating/aggregating all the val located in line.element.property.element.val
I tried also to explode it to get every field in a column "csv style" but i got the error:
pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(
UrbanDataset
.context
,UrbanDataset
.specification
,UrbanDataset
.values
)' due to data type mismatch: input to function array should all be the same type
import pyspark
import pyspark.sql.functions as psf
df = spark.read.format('json').load('data1.json')
df.select(psf.explode(psf.array("UrbanDataset.*"))).show()
Thanks
Upvotes: 0
Views: 5277
Reputation: 15258
You cannot access directly nested arrays, you need to use explode
before.
It will create a line for each element in the array.
from pyspark.sql import functions as F
df.withColumn("Value", F.explode("Values"))
Upvotes: 2