Reputation: 5735
I have a parquet file i am reading with spark:
SparkSession.builder()
.appName("test")
.config("spark.sql.parquet.compression.codec", "gzip")
.read().parquet(resourcePath)
This is the code snippet used to read the parquet file.
When the file is not compressed all goes fine, but when i gzip it:
gzip fileName.parquet
Then i get a RuntimeException:
is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [44, 64, 91, 0]
but gzip format should be supported, it s supported, what am i doing wrong here?
Upvotes: 5
Views: 8093
Reputation: 10406
Gzip is supported by Spark and by Parquet, but not like this.
Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. [...] It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
So parquet is a file format that can use gzip as its compression algorithm, but if you compress a parquet file with gzip yourself, it won't be a parquet file anymore. For instance, in spark you can do this:
val spark = SparkSession.builder
.config("spark.sql.parquet.compression.codec", "gzip")
.getOrCreate
spark.range(10).write.parquet(".../test.parquet")
If I have a look at test.parquet
, it is a directory containing gzip files:
> cat test.parquet/
part-00000-890dc5e5-ccfe-4e60-877a-79585d444149-c000.gz.parquet
part-00001-890dc5e5-ccfe-4e60-877a-79585d444149-c000.gz.parquet
_SUCCESS
Spark also supports gzip files. So if I create a text file and gzip it myself like this:
> cat file.txt
ab
cd
> gzip file.txt
And with spark:
scala> sc.textFile("hdfs:///tmp/file.txt.gz").collect
res6: Array[String] = Array(ab, cd)
Upvotes: 4