Mallar Sen
Mallar Sen

Reputation: 13

How to uncompress a parquet file?

I am having a test.parquet file whose size is around 60MB. Using the below script, I found that the column compression is GZIP for the parquet file.

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile("C://Users/path/test.parquet")
print(parquet_file.metadata.row_group(0).column(0))

OUTPUT

<pyarrow._parquet.ColumnChunkMetaData object at 0x0000017E6AC9FBD8>
  file_offset: 4
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 173664
  path_in_schema: event-id
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x0000017E6AC9FE08>
      has_min_max: True
      min: 00004759-abeb-40fc-a9c6-1c79ab7c6726
      max: ffffe406-0a2f-42d9-a882-784e3527102d
      null_count: 0
      distinct_count: 0
      num_values: 173664
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: GZIP
  encodings: ('PLAIN', 'BIT_PACKED')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 3796510
  total_uncompressed_size: 6947287

I want to uncompress this parquet file before processing. Using python how can I uncompress this parquet file which has GZIP compression?

Upvotes: 1

Views: 8025

Answers (2)

Felix K Jose
Felix K Jose

Reputation: 882

You can use pyspark to achieve this.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ParquetReaderTesting").getOrCreate()

df = spark.read.parquet("data/")  # Reads all parquet files in that directory and Spark takes care of uncompress 
# the data
# df = spark.read.parquet("data/<Specific parquet file>")
df.show()
df.printSchema()

Upvotes: 1

Ben Schmidt
Ben Schmidt

Reputation: 146

Compression can vary by column in parquet, so you can't be sure it's all compressed as gzip, just this column. In general, the pyarrow parquet reader will handle decompression for you transparently; you can just do

pq.read_table('example.parquet')

or (for a pandas dataframe)

pq.read_table('example.parquet').to_pandas()

The lower-level pq.ParquetFile file interface is useful if you want to stream data in to avoid reading it all into memory, but in that case you wouldn't be decompressing the whole file before proceeding.

Upvotes: 4

Related Questions