Matthew Thomas
Matthew Thomas

Reputation: 861

How to determine if checksums are present in parquet file?

I recently came across an error when reading a parquet table in pyspark:

Caused by: org.apache.parquet.io.ParquetDecodingException: could not verify page integrity, CRC checksum verification failed

This table was transferred over FTP. The error was resolved when I deleted the table in the destination and transferred it again from the source. Therefore, one if not more of the parquet files must have been corrupted when transferring over the network during the first attempt.

I am concerned that the generic job used to transfer this table is not robust to data corruption in transit. I want to inspect other tables that were transferred using this same generic job. However, I suspect that the underlying parquet files of most of these tables do not contain checksums. Otherwise chances are that my team would have run into this error before.

Sources I have come across lead me to believe that checksums in parquet page headers are optional and not enabled by default. If a checksum is present in the file it would be an integer and otherwise NA

Is there any way in python to read the crc of parquet pages directly or at least determine if one is present indirectly? The stackoverflow question and answer below seems to suggest it cannot be done in pyarrow, unfortunately.

How do I get page level data of a parquet file with pyarrow?


Update 2025-01-30

I have not found a solution in python after further digging. Python packages parquet-tools and parquet-cli are not granular enough. However, the java version of parquet-tools is appropriate. The easiest way I was able to get the java version running is by leveraging a docker version: https://github.com/rm3l/docker-parquet-tools

$docker container run -v ./local/path/to/parquet:/container/path/to/parquet --rm -t rm3l/parquet-tools:latest dump -n /container/path/to/parquet/test.parquet

I ran the two experiments below to determine if I can reliably detect corrupted parquet files.

Test A:

data = [("John", 28), ("Anna", 23), ("Mike", 35), ("Sara", 30), ("David", 40)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.repartition(1).write.option("parquet.page.write-checksum.enabled", "true").parquet(...)

Then using parquet-tools produced output including the line below:

page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] CRC:[PAGE CORRUPT] SZ:47 VC:5

Test B:

data = [("John", 28), ("Anna", 23), ("Mike", 35), ("Sara", 30), ("David", 40)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.repartition(1).write.option("parquet.page.write-checksum.enabled", "false").parquet(...)

Then using parquet-tools produced output including the line below:

page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] CRC:[none] SZ:47 VC:5

The results of Test A are very unexpected because it should not have resulted in a corrupted file and yet I am reading CRC:[PAGE CORRUPT] rather than an expected integer value.

Upvotes: 1

Views: 95

Answers (0)

Related Questions