Dror B.
Dror B.

Reputation: 83

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table.

One of the columns is defined as int both in the parquet schema and in the spark table.

Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files. I changed also the type in the spark table to bigint.

However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.

Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.

Is there another way to read from parquet an int column as long in spark?

Upvotes: 3

Views: 8529

Answers (2)

ketankk
ketankk

Reputation: 2674

This mostly happens when columns in .parquet files are in double or float

One liner answer, set

spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

TL;DR

When we read data using spark, specially parquet data

data = spark.read.parquet(source_path)

Spark tries to optimize and read data in vectorized format from the .parquet files. And even if we do explicit data type casting,

new_data = data.withColumn(col_name, col(col_name).cast(TimestampType()))

spark will use native data types in parquet(whatever original data type was there in .parquet files).

This causes issue while writing data due to mismatch in data and column type

To resolve this issue, diasable vectorized reader.

To know about Vectorized reader, refer below

The vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality, and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact. You should only disable it, if you have decimal type columns in your source data.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Also answered here: https://stackoverflow.com/a/77818423/3819751

Upvotes: 0

iambdot
iambdot

Reputation: 945

using pyspark couldn't you just do

df = spark.read.parquet('path to parquet files')

the just change the cast the column type in the dataframe

new_df = (df
          .withColumn('col_name', col('col_name').cast(LongType()))
         )

and then just save the new dataframe to same location with overwrite mode

Upvotes: 2

Related Questions