Reputation: 83
I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files. I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?
Upvotes: 3
Views: 8529
Reputation: 2674
This mostly happens when columns in .parquet files are in double or float
One liner answer, set
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
TL;DR
When we read data using spark, specially parquet
data
data = spark.read.parquet(source_path)
Spark tries to optimize and read data in vectorized format from the .parquet
files. And even if we do explicit data type casting,
new_data = data.withColumn(col_name, col(col_name).cast(TimestampType()))
spark will use native data types in parquet(whatever original data type was there in .parquet files).
This causes issue while writing data due to mismatch in data and column type
To resolve this issue, diasable vectorized reader.
To know about Vectorized reader, refer below
The vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality, and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact. You should only disable it, if you have decimal type columns in your source data.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Also answered here: https://stackoverflow.com/a/77818423/3819751
Upvotes: 0
Reputation: 945
using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode
Upvotes: 2