Reputation: 1412
I am trying to read avro files written by pyspark with different schema. Difference in precision of decimal column. Below is my folder structure of avro folders written by pyspark
/mywork/avro_data/day1/part-*
/mywork/avro_data/day2/part-*
below are the schema for them
day1 = spark.read.format('avro').load('/mywork/avro_data/day1')
day1.printSchema()
root
|-- price: decimal(5,2) (nullable = True)
day2 = spark.read.format('avro').load('/mywork/avro_data/day2')
day2.printSchema()
root
|-- price: decimal(20,2) (nullable = True)
While reading the whole dataframe (for both days)
>>> df = spark.read.format('avro').load('/mywork/avro_data/')
it is giving below error
java.lang.IllegalArgumentException: unscaled value too large for precision spark
Why pyspark doesn't implicitly considers the higher schema (backward compatible)
Upvotes: 0
Views: 1007
Reputation: 6323
spark uses first sample record to infer the schema. I think, in your case that sample record is of decimal(5, 2)
causing this exception.
Regarding your question-
Why does pyspark doesn't implicitly considers the higher schema?
To achieve this, spark needs to read the whole data twice. first to infer schema and second for processing.
Imagine, even df.limit(1)
will read whole file first to infer the schema and then to read 1st record if you go this way.
There is an option to specify avroSchema
option as below -
val p = spark
.read
.format("avro")
.option("avroSchema", schema.toString)
.load(directory)
p.show(false)
but here each avro file inside .load(directory)
should match the schema which is not the case here.
Alternative
read both the dataframe and then do union
Upvotes: 1