Reputation: 21
I received the error "Row group size is too large (2111110000)" while writing a DataFrame into Parquet files.
I am assuming this is due to trying to load a table from SQL Server that has very large rows (with a maximum row size of 2.5 GB).
Because of this, I am unable to load the data. Source Table Details: Number of rows: 1000 Total size: 200 GB Issue: One particular column contains a huge JSON string, causing this issue.
Questions: How can I load this data into Parquet? Can Parquet handle such a huge column with a value of 1.5 GB? Does Parquet try to load one column in one row group, or will it load the data in multiple row groups if the column size is huge?
I have tried to load the dataframe into parquet table in spark ,but receive some error which relates to the parquet internals
Upvotes: 0
Views: 530
Reputation: 3250
Writing the DataFrame to Parquet format with a specified row group size.
Approach 1:
Increasing the spark.executor.memory
could solve problems with large row group sizes.
When you are working with large datasets that require more memory,
setting a higher memory allocation for executors can resolve the memory-related errors.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ExampleApp") \
.config("spark.executor.memory", "8g") \
.getOrCreate()
Results:
Executor memory: 8g
Approach 2:
I have tried the below approach as workaround:
As you mentioned one column huge JSON
string In one row group.
String columns usually compress well and have dictionary lookups applied to them (Parquet does this)
When column sizes within a row group are causing the problem.
You can try the below:
df = spark.read.format("delta").load("/FileStore/tables/dilip02")
df.write.format("parquet").mode("overwrite").option("rowGroupSize", 1000000).save("/file_new.parquet")
Results:
Number of rows: 3
root
|-- id: long (nullable = true)
|-- json_column: string (nullable = true)
Reference: When Parquet Columns Get Too Big
You can refer this solution provided @TagarSO link
Upvotes: 0