Shern
Shern

Reputation: 831

Dataframe schema is different from manually defined schema (textbook example)

Am just following the example in Spark - The Definitive Guide (chapter 5):

from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
    StructField('DEST_COUNTRY_NAME', StringType(), True),
    StructField('ORIGIN_COUNTRY_NAME', StringType(), True),
    StructField('count', LongType(), False, metadata={'hello': 'world'})])

df = spark.read.format('json').schema(myManualSchema).load('/data/flight-data/json/2015-summary.json')

But when I print the schema, it shows that count is still nullable. Any reason why? I am using PySpark (Spark 2.4.5) in Zeppelin docker 0.8.1.

print(myManualSchema)
print(df.schema)

>>> StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,false)))
>>> StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

Upvotes: 0

Views: 232

Answers (1)

Ged
Ged

Reputation: 18023

It's simply the way Spark works, when coming from a file source.

It's a built in feature that allows things not to fail at run-time. A sort of safety valve.

If you google, you will find Q&A stating the same I am sure.

If you do it from some trivial own dataframe example using val df = Seq(...), then this may not occur.

Has nothing to do with pyspark, just a Spark generic aspect. Text books can also be wrong or the APIs are subject to change.

Upvotes: 1

Related Questions