Reputation: 417
I am new to Apache Spark, so forgive me if this is a noob question. I am trying to define a particular schema before reading in the dataset in order to speed up processing. There are a few data types that I am not sure how to define (ArrayType and StructType).
Here is a screenshot of the schema I am working with:
Here is what I have so far:
jsonSchema = StructType([StructField("attribution", ArrayType(), True),
StructField("averagingPeriod", StructType(), True),
StructField("city", StringType(), True),
StructField("coordinates", StructType(), True),
StructField("country", StringType(), True),
StructField("date", StructType(), True),
StructField("location", StringType(), True),
StructField("mobile", BooleanType(), True),
StructField("parameter", StringType(), True),
StructField("sourceName", StringType(), True),
StructField("sourceType", StringType(), True),
StructField("unit", StringType(), True),
StructField("value", DoubleType(), True)
])
My question is: How do I account for the name and url under the attribution column, the unit and value under the averagingPeriod column, etc?
For reference, here is the dataset I am using: https://registry.opendata.aws/openaq/.
Upvotes: 0
Views: 1250
Reputation: 42402
Here's an example of array type and struct type. I think it should be straightforward to do this for all other columns.
from pyspark.sql.types import *
jsonSchema = StructType([
StructField("attribution", ArrayType(StructType([StructField("name", StringType()), StructField("url", StringType())])), True),
StructField("averagingPeriod", StructType([StructField("unit", StringType()), StructField("value", DoubleType())]), True),
# ... etc.
])
Upvotes: 1