[pyspark ]MutableFloat cannot be cast to MutableDouble while merging parquet files

Question

I have many parquet files inside a S3 folder. Each one has 'A', 'B', 'C' Columns. 'A' and 'B' columns have string data type but 'C' column has Float type in some and Double in others. I want to merge these parquet files and create larger one. I am doing this merge inside AWS Glue using pyspark.

When I try to write the Dataframe into S3 using

     output_s3_path = "s3://new_path/"
     df.write.mode("overwrite").parquet(output_s3_path)

I am getting error

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableDouble at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setDouble(SpecificInternalRow.scala:284)

I tried with following ways

spark.read.option("mergeSchema", "true") to merge the schema, ex:

s3_df = spark.read.option("overwriteSchema", "true").parquet("s3://path/")
spark.read.option("overwriteSchema", "true") to overite the schema with new data type
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

But not of them worked. Setting the schema expilicitely would not a better solution for me as I expect to merge many different parquet file sets using the same solution.

How can I solve this issue?

[pyspark ]MutableFloat cannot be cast to MutableDouble while merging parquet files

Answers (1)

Related Questions