Aspire
Aspire

Reputation: 417

Spark: How to define a nested schema?

I am new to Apache Spark, so forgive me if this is a noob question. I am trying to define a particular schema before reading in the dataset in order to speed up processing. There are a few data types that I am not sure how to define (ArrayType and StructType).

Here is a screenshot of the schema I am working with:

1

Here is what I have so far:

jsonSchema = StructType([StructField("attribution", ArrayType(), True), 
                         StructField("averagingPeriod", StructType(), True),
                         StructField("city", StringType(), True),
                         StructField("coordinates", StructType(), True),
                         StructField("country", StringType(), True),
                         StructField("date", StructType(), True),
                         StructField("location", StringType(), True),
                         StructField("mobile", BooleanType(), True),
                         StructField("parameter", StringType(), True),
                         StructField("sourceName", StringType(), True),
                         StructField("sourceType", StringType(), True),
                         StructField("unit", StringType(), True),
                         StructField("value", DoubleType(), True)
                        ])

My question is: How do I account for the name and url under the attribution column, the unit and value under the averagingPeriod column, etc?

For reference, here is the dataset I am using: https://registry.opendata.aws/openaq/.

Upvotes: 0

Views: 1250

Answers (1)

mck
mck

Reputation: 42402

Here's an example of array type and struct type. I think it should be straightforward to do this for all other columns.

from pyspark.sql.types import *

jsonSchema = StructType([
    StructField("attribution", ArrayType(StructType([StructField("name", StringType()), StructField("url", StringType())])), True),
    StructField("averagingPeriod", StructType([StructField("unit", StringType()), StructField("value", DoubleType())]), True),
    # ... etc.
])

Upvotes: 1

Related Questions