JBoy
JBoy

Reputation: 5735

Reading gzipped parquet files from spark

I have a parquet file i am reading with spark:

SparkSession.builder()
    .appName("test")
    .config("spark.sql.parquet.compression.codec", "gzip")
    .read().parquet(resourcePath)

This is the code snippet used to read the parquet file.
When the file is not compressed all goes fine, but when i gzip it:

gzip fileName.parquet

Then i get a RuntimeException:

is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [44, 64, 91, 0]

but gzip format should be supported, it s supported, what am i doing wrong here?

Upvotes: 5

Views: 8093

Answers (1)

Oli
Oli

Reputation: 10406

Gzip is supported by Spark and by Parquet, but not like this.

Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. [...] It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

So parquet is a file format that can use gzip as its compression algorithm, but if you compress a parquet file with gzip yourself, it won't be a parquet file anymore. For instance, in spark you can do this:

val spark = SparkSession.builder
    .config("spark.sql.parquet.compression.codec", "gzip")
    .getOrCreate
spark.range(10).write.parquet(".../test.parquet")

If I have a look at test.parquet, it is a directory containing gzip files:

> cat test.parquet/
part-00000-890dc5e5-ccfe-4e60-877a-79585d444149-c000.gz.parquet
part-00001-890dc5e5-ccfe-4e60-877a-79585d444149-c000.gz.parquet
_SUCCESS

Spark also supports gzip files. So if I create a text file and gzip it myself like this:

> cat file.txt
ab
cd
> gzip file.txt

And with spark:

scala> sc.textFile("hdfs:///tmp/file.txt.gz").collect
res6: Array[String] = Array(ab, cd)

Upvotes: 4

Related Questions