Spark: How to define a nested schema?

Question

I am new to Apache Spark, so forgive me if this is a noob question. I am trying to define a particular schema before reading in the dataset in order to speed up processing. There are a few data types that I am not sure how to define (ArrayType and StructType).

Here is a screenshot of the schema I am working with:

Here is what I have so far:

jsonSchema = StructType([StructField("attribution", ArrayType(), True), 
                         StructField("averagingPeriod", StructType(), True),
                         StructField("city", StringType(), True),
                         StructField("coordinates", StructType(), True),
                         StructField("country", StringType(), True),
                         StructField("date", StructType(), True),
                         StructField("location", StringType(), True),
                         StructField("mobile", BooleanType(), True),
                         StructField("parameter", StringType(), True),
                         StructField("sourceName", StringType(), True),
                         StructField("sourceType", StringType(), True),
                         StructField("unit", StringType(), True),
                         StructField("value", DoubleType(), True)
                        ])

My question is: How do I account for the name and url under the attribution column, the unit and value under the averagingPeriod column, etc?

For reference, here is the dataset I am using: https://registry.opendata.aws/openaq/.

mck · Accepted Answer

Here's an example of array type and struct type. I think it should be straightforward to do this for all other columns.

from pyspark.sql.types import *

jsonSchema = StructType([
    StructField("attribution", ArrayType(StructType([StructField("name", StringType()), StructField("url", StringType())])), True),
    StructField("averagingPeriod", StructType([StructField("unit", StringType()), StructField("value", DoubleType())]), True),
    # ... etc.
])

Spark: How to define a nested schema?

Answers (1)

Related Questions