Reputation: 405
I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and lzo also?
Upvotes: 18
Views: 30860
Reputation: 8796
The supported compression types for Apache Parquet are specified in the parquet-format
repository:
/**
* Supported compression algorithms.
*
* Codecs added in 2.4 can be read by readers based on 2.4 and later.
* Codec support may vary between readers based on the format version and
* libraries available at runtime. Gzip, Snappy, and LZ4 codecs are
* widely available, while Zstd and Brotli require additional libraries.
*/
enum CompressionCodec {
UNCOMPRESSED = 0;
SNAPPY = 1;
GZIP = 2;
LZO = 3;
BROTLI = 4; // Added in 2.4
LZ4 = 5; // Added in 2.4
ZSTD = 6; // Added in 2.4
}
Snappy and Gzip are the most commonly used ones and are supported by all implementations. LZ4 and ZSTD yield better results the former two but are a rather new addition to the format, so they are only supported in the newer versions of some of the implementations.
Upvotes: 24
Reputation: 9067
From the Spark source code, branch 2.1:
You can set the following Parquet-specific option(s) for writing Parquet files:
compression
(default is the value specified inspark.sql.parquet.compression.codec
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,snappy
,gzip
, andlzo
).
This will overridespark.sql.parquet.compression.codec
...
overall supported compresssions are: none
, uncompressed
, snappy
, gzip
, lzo
, brotli
, lz4
, and zstd
Upvotes: 11