Reputation: 6085
I have a CSV file, test.csv
:
col
1
2
3
4
When I read it using Spark, it gets the schema of data correct:
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("test.csv")
df.printSchema
root
|-- col: integer (nullable = true)
But when I override the schema
of CSV file and make inferSchema
false, then SparkSession is picking up custom schema partially.
val df = spark.read.option("header", "true").option("inferSchema", "false").schema(StructType(List(StructField("custom", StringType, false)))).csv("test.csv")
df.printSchema
root
|-- custom: string (nullable = true)
I mean only column name (custom
) and DataType (StringType
) are getting picked up. But, nullable
part is being ignored, as it is still coming nullable = true
, which is incorrect.
I am not able to understand this behavior. Any help is appreciated !
Upvotes: 3
Views: 1494
Reputation: 30300
Consider this excerpt from the documentation about Parquet (a popular "Big Data" storage format):
"Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons."
CSV is handled the same way for the same reason.
As for what "compatibility reasons" means, Nathan Marz in his book Big Data describes that an ideal storage schema is both strongly typed for integrity and flexible for evolution. In other words, it should be easy to add and remove fields and not have your analytics blow up. Parquet is both typed and flexible; CSV is just flexible. Spark honors that flexibility by making columns nullable no matter what you do. You can debate whether you like that approach.
A SQL table has schemas rigorously defined and hard to change--so much so Scott Ambler wrote a big book on how to refactor them. Parquet and CSV are much less rigorous. They are both suited to the paradigms for which they were built, and Spark's approach is to take the liberal approach typically associated with "Big Data" storage formats.
Upvotes: 2
Reputation: 1724
I believe the “inferSchema” property is common and applicable for all the elements in a dataframe. But, If we want to change the nullable property of a specific element.
We could handle/set something like,
setNullableStateOfColumn(df, ”col", false )
def setNullableStateOfColumn(df:DataFrame, cn: String, nullable: Boolean) : DataFrame = {
// get schema
val schema = df.schema
// modify [[StructField] with name `cn`
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t, nullable = nullable, m)
case y: StructField => y
})
// apply new schema
df.sqlContext.createDataFrame( df.rdd, newSchema )
}
There is a similar thread for setting the nullable property of an element,
Change nullable property of column in spark dataframe
Upvotes: 1