Reputation: 13
I am having a test.parquet file whose size is around 60MB. Using the below script, I found that the column compression is GZIP for the parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile("C://Users/path/test.parquet")
print(parquet_file.metadata.row_group(0).column(0))
OUTPUT
<pyarrow._parquet.ColumnChunkMetaData object at 0x0000017E6AC9FBD8>
file_offset: 4
file_path:
physical_type: BYTE_ARRAY
num_values: 173664
path_in_schema: event-id
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x0000017E6AC9FE08>
has_min_max: True
min: 00004759-abeb-40fc-a9c6-1c79ab7c6726
max: ffffe406-0a2f-42d9-a882-784e3527102d
null_count: 0
distinct_count: 0
num_values: 173664
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP
encodings: ('PLAIN', 'BIT_PACKED')
has_dictionary_page: False
dictionary_page_offset: None
data_page_offset: 4
total_compressed_size: 3796510
total_uncompressed_size: 6947287
I want to uncompress this parquet file before processing. Using python how can I uncompress this parquet file which has GZIP compression?
Upvotes: 1
Views: 8025
Reputation: 882
You can use pyspark to achieve this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ParquetReaderTesting").getOrCreate()
df = spark.read.parquet("data/") # Reads all parquet files in that directory and Spark takes care of uncompress
# the data
# df = spark.read.parquet("data/<Specific parquet file>")
df.show()
df.printSchema()
Upvotes: 1
Reputation: 146
Compression can vary by column in parquet, so you can't be sure it's all compressed as gzip, just this column. In general, the pyarrow parquet reader will handle decompression for you transparently; you can just do
pq.read_table('example.parquet')
or (for a pandas dataframe)
pq.read_table('example.parquet').to_pandas()
The lower-level pq.ParquetFile
file interface is useful if you want to stream data in to avoid reading it all into memory, but in that case you wouldn't be decompressing the whole file before proceeding.
Upvotes: 4