Reputation: 831
Am just following the example in Spark - The Definitive Guide (chapter 5):
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
StructField('DEST_COUNTRY_NAME', StringType(), True),
StructField('ORIGIN_COUNTRY_NAME', StringType(), True),
StructField('count', LongType(), False, metadata={'hello': 'world'})])
df = spark.read.format('json').schema(myManualSchema).load('/data/flight-data/json/2015-summary.json')
But when I print the schema, it shows that count
is still nullable. Any reason why? I am using PySpark (Spark 2.4.5) in Zeppelin docker 0.8.1.
print(myManualSchema)
print(df.schema)
>>> StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,false)))
>>> StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))
Upvotes: 0
Views: 232
Reputation: 18023
It's simply the way Spark works, when coming from a file source.
It's a built in feature that allows things not to fail at run-time. A sort of safety valve.
If you google, you will find Q&A stating the same I am sure.
If you do it from some trivial own dataframe example using val df = Seq(...)
, then this may not occur.
Has nothing to do with pyspark, just a Spark generic aspect. Text books can also be wrong or the APIs are subject to change.
Upvotes: 1