MrCartoonology
MrCartoonology

Reputation: 2067

How do I use Python Spark API to specify a dataframe schema by hand?

I would like to specify a schema for spark dataframes in python. After I load the data once, I can print the Schema, I might see something like

df = spark.read.json(datapath)
df.schema

StructType(List(StructField(fldname,StringType,true)))

Having created this python object: df.schema by reading the data, I can now use it to read more. However I think I will wait less if I don't have to first read the data to get the schema - I'd like to persist the schema, even just typing in the schema in my script. For typing it in, I've tried

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([ StructField('fldname', StringType, True)])

but I get the message

AssertionError: dataType should be DataType

This is spark 2.0.2

Upvotes: 1

Views: 536

Answers (1)

koiralo
koiralo

Reputation: 23119

While creating the schema you missed () parenthesis

schema = StructType([ StructField('fldname', StringType(), True)])

In python, you need to construct as StringType() instead of using a singleton.

Hope this solved the issue.

Upvotes: 2

Related Questions