Reputation: 2067
I would like to specify a schema for spark dataframes in python. After I load the data once, I can print the Schema, I might see something like
df = spark.read.json(datapath)
df.schema
StructType(List(StructField(fldname,StringType,true)))
Having created this python object: df.schema
by reading the data, I can now use it to read more. However I think I will wait less if I don't have to first read the data to get the schema - I'd like to persist the schema, even just typing in the schema in my script. For typing it in, I've tried
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([ StructField('fldname', StringType, True)])
but I get the message
AssertionError: dataType should be DataType
This is spark 2.0.2
Upvotes: 1
Views: 536
Reputation: 23119
While creating the schema you missed () parenthesis
schema = StructType([ StructField('fldname', StringType(), True)])
In python, you need to construct as StringType() instead of using a singleton.
Hope this solved the issue.
Upvotes: 2