Reputation: 2611
I am trying to load some json file to pyspark with only specific columns like below
df = spark.read.json("sample/json/", schema=schema)
So I started writing a input read schema for below main schema
|-- test_name: string (nullable = true)
|-- test_file: string (nullable = true)
|-- test_id: string (nullable = true)
|-- test_type: string (nullable = true)
|-- test_url: string (nullable = true)
|-- test_ids: array (nullable = true)
| |-- element: string (containsNull = true)
|-- value: struct (nullable = true)
| |-- ct: long (nullable = true)
| |-- dimmingSetting: long (nullable = true)
| |-- hue: double (nullable = true)
| |-- modeId: string (nullable = true)
I tried to write for the direct string type but I am not able to write for array and struct type
schema = StructType([
StructField('test_name', StringType()),
StructField('test_file', StringType()),
StructField('test_id', StringType()),
StructField('test_type', StringType()),
StructField('test_url', StringType()),
])
How to extend this schema for
|-- test_ids: array (nullable = true)
|-- value: struct (nullable = true)
Upvotes: 2
Views: 4518
Reputation: 41987
the extended version should be
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, LongType, DoubleType
schema = StructType([
StructField('test_name', StringType(), True),
StructField('test_file', StringType(), True),
StructField('test_id', StringType(), True),
StructField('test_type', StringType(), True),
StructField('test_url', StringType(), True),
StructField('test_ids', ArrayType(StringType(), True), True),
StructField('value', StructType([
StructField('ct', LongType(), True),
StructField('dimmingSetting', LongType(), True),
StructField('hue', DoubleType(), True),
StructField('modeId', StringType(), True)
])
)
])
I hope the answer is helpful
Upvotes: 3