Reputation: 2165
Reading a CSV file with defined schema I am able to load the files and do the processing, which works fine using the below code. The schema is defined as to strictly follow the datatype to record precision's with accuracy.
source_schema = StructType([
StructField("COL1", StringType(), True),
StructField("COL2", StringType(), True),
StructField("COL3", StringType(), True),
StructField("COL4", StringType(), True),
StructField("COL5", StringType(), True)])
df_raw_file = in_spark.read \
.format("csv") \
.option("delimiter", delimiter) \
.option("header", "false") \
.option("inferSchema", "true") \
.option("columnNameOfCorruptRecord", "BAD_RECORD") \
.schema(source_schema) \
.load(file)
Now the CSV file which we receive is going to omit few columns from next year, let's say COL4 is not going to be part of the file henceforth. But we should be able to process both the files, as we do reprocess old files if required, so how can such a requirement be handled. I could probably read a sample of the CSV get the columns using df.columns and compare the two predefined schema's. Would be helpful if I can get a lead if there are any other alternatives.
Upvotes: 1
Views: 826
Reputation: 293
If you set the mode
option to PERMISSIVE
it should handle your situation https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv#pyspark.sql.DataFrameReader.csv:
When it meets a record having fewer tokens than the length of the schema, sets null to extra fields.
Upvotes: 1