ranger_sim_g
ranger_sim_g

Reputation: 175

How do I find the parquet.writer.version from an existing parquet file?

I've started to play with Apache Parquet I was surprised about 2 versions of writers.

PARQUET_1_0 ("v1"),
PARQUET_2_0 ("v2");

Source: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L69

I tried to get the metadata/dump using parquet-tools to determine the version, but it did not include this info.

Currently I have a parquet file.

How do I determine the parquet write version used to write this file?

Upvotes: 0

Views: 1196

Answers (2)

Amaura
Amaura

Reputation: 11

I've been struggling with this. If you install parquet-tools you can do a :

parquet-tools inspect <parquet file> --detail | head -n2

and you get the version which is different from the format version :

FileMetaData
    version = 1

however not sure if it is impacted by the file writer version ...

Upvotes: 1

lgao
lgao

Reputation: 31

You can use pyarrow.parquet to view the writer version of the Parquet file:

import pyarrow.parquet as pq    
parquet_file = pq.ParquetFile('sample.parquet')
parquet_file.metadata

This would print something like:

<pyarrow._parquet.FileMetaData object at 0x7f72447fc530>
  created_by: parquet-mr version 1.12.2 (build d35ce51f56a2166b09164cc89d7c18ce346dc83f)
  num_columns: 14
  num_rows: 11464901
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 3277

And format_version is what you are looking for.

See https://arrow.apache.org/docs/python/parquet.html

Upvotes: 1

Related Questions