Reputation: 609
I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error:
df = spark.read.parquet("path/to/file.parquet")
But when I try to do any operation like .show()
or .repartition(n)
I run into the following error:
java.lang.IllegalArgumentException: Illegal Capacity: -102
any ideas on how I can fix this?
Upvotes: 2
Views: 2683
Reputation: 1269
It's an integer overflow bug in the underlying parquet reader. https://issues.apache.org/jira/browse/PARQUET-1633
Upgrade PySpark to 3.2.1. The jar file parquet-hadoop-1.12.2 contains the code/actual fix.
Upvotes: 3